Picard

Build Status

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

View the Project on GitHub broadinstitute/picard

Command Line Overview

This page provides detailed documentation on the Picard command-line syntax and standard options as well as a complete list of tools, usage recommendations, options, and example commands.

See the Quick Start documentation for instructions on how to obtain and set up the tools if you haven't already.

Command Syntax

The tools are invoked as follows:

java jvm-args -jar picard.jar PicardToolName \
	OPTION1=value1 \
	OPTION2=value2

The backslashes at the end of each line except the last are used to allow formatting the command line on multiple lines. The command above could also be written on a single line; however in the documentation we usually format command to display on multiple lines for clarity.

jvm-args are arguments that are specific to Java, not Picard. We do not provide guidance for specifying JVM (Java Virtual Machine) arguments except for specifying memory: most of the commands are designed to run in 2GB of JVM, so we recommend using the JVM argument -Xmx2g.

PicardToolName refers to the name of the tool you want to run. It must always be the first argument after the jar file path. Some examples include: CollectAlignmentSummaryMetrics, BuildBamIndex, and CreateSequenceDictionary. The tools and their respective options are described in detail below.

OPTIONs can be any of the Standard Options listed first below, and/or any of the tool-specific options listed for each tool. Argument values can be filenames, strings (of alphanumeric characters), enumerated values, floats (decimal numbers), integers, or boolean (true/false). Options frequently have a value that is set be default but can be modified by specifying a different value in the command line.

For example, INPUT=input.bam specifies a file, where INPUT is the OPTION and input.bam is the value indicating what file to use. As another example, STRANDSPECIFICITY=NONE specifies the argument STRANDSPECIFICITY for which the documentation enumerates acceptable values: NONE, FIRST_READ_TRANSCRIPTION_STRAND, and SECOND_READ_TRANSCRIPTION_STRAND.

Usage Example

Here's a typical example where we run CollectAlignmentSummaryMetrics, a quality control tool that takes a reference file and a input file containing sequencing data, and outputs some quality control metrics.

java -jar picard.jar CollectAlignmentSummaryMetrics \
	REFERENCE=my_data/reference.fasta \
	INPUT=my_data/input.bam \
	OUTPUT=results/output.txt

Note that values for arguments that specify files must include appropriate file paths, either absolute or relative to the working directory. In this example we are reading in input data from a subdirectory called my_data and writing outputs to another subdirectory called results.

Standard Options

The following standard options are relevant to most Picard tools:

OptionDescription
--helpDisplays options specific to this tool.
--stdhelpDisplays options specific to this tool AND options common to all Picard command line tools.
--versionDisplays program version.
TMP_DIR (File)Default value: null. This option may be specified 0 or more times.
VERBOSITY (LogLevel)Control verbosity of logging. Default value: INFO. This option can be set to 'null' to clear the default value. Possible values: {ERROR, WARNING, INFO, DEBUG}
QUIET (Boolean)Whether to suppress job-summary info on System.err. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
VALIDATION_STRINGENCY (ValidationStringency)Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to 'null' to clear the default value. Possible values: {STRICT, LENIENT, SILENT}
COMPRESSION_LEVEL (Integer)Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to 'null' to clear the default value.
MAX_RECORDS_IN_RAM (Integer)When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to 'null' to clear the default value.
CREATE_INDEX (Boolean)Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CREATE_MD5_FILE (Boolean)Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REFERENCE_SEQUENCE (File)Reference sequence file. Default value: null.
GA4GH_CLIENT_SECRETS (String)Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to 'null' to clear the default value.
USE_JDK_DEFLATER (Boolean)Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
USE_JDK_INFLATER (Boolean)Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

Tool-Specific Documentation

Below, you will find detailed documentation of all the options that are specific to each tool. Keep in mind that some tools may require one or more of the standard options listed below; this is usually specified in the tool description. For the many tools that collect quality control metrics, documentation of what the metrics mean and how they are calculated is provided on the Picard Metrics Definitions page.

Be sure to consult the Frequently Asked Questions (FAQ) page for useful information about typical problems you may encounter while using Picard tools.

AddCommentsToBam

Adds comments to the header of a BAM file.This tool makes a copy of the input bam file, with a modified header that includes the comments specified at the command line (prefixed by @CO). Use double quotes to wrap comments that include whitespace or special characters.

Note that this tool cannot be run on SAM files.

Usage example:

java -jar picard.jar AddCommentsToBam \
I=input.bam \
O=modified_bam.bam \
C=comment_1 \
C="comment 2"

OptionDescription
INPUT (File)Input BAM file to add a comment to the header Required.
OUTPUT (File)Output BAM file to write results Required.
COMMENT (String)Comments to add to the BAM file Default value: null. This option may be specified 0 or more times.

AddOrReplaceReadGroups

Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.

For more information about read groups, see the GATK Dictionary entry.

This tool accepts INPUT BAM and SAM files or URLs from the Global Alliance for Genomics and Health (GA4GH) (see http://ga4gh.org/#/documentation).

Usage example:

java -jar picard.jar AddOrReplaceReadGroups \
I=input.bam \
O=output.bam \
RGID=4 \
RGLB=lib1 \
RGPL=illumina \
RGPU=unit1 \
RGSM=20

OptionDescription
INPUT (String)Input file (BAM or SAM or a GA4GH url). Required.
OUTPUT (File)Output file (BAM or SAM). Required.
SORT_ORDER (SortOrder)Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RGID (String)Read Group ID Default value: 1. This option can be set to 'null' to clear the default value.
RGLB (String)Read Group library Required.
RGPL (String)Read Group platform (e.g. illumina, solid) Required.
RGPU (String)Read Group platform unit (eg. run barcode) Required.
RGSM (String)Read Group sample name Required.
RGCN (String)Read Group sequencing center name Default value: null.
RGDS (String)Read Group description Default value: null.
RGDT (Iso8601Date)Read Group run date Default value: null.
RGPI (Integer)Read Group predicted insert size Default value: null.
RGPG (String)Read Group program group Default value: null.
RGPM (String)Read Group platform model Default value: null.

BaitDesigner

Designs oligonucleotide baits for hybrid selection reactions.

This tool is used to design custom bait sets for hybrid selection experiments. The following files are input into BaitDesigner: a (TARGET) interval list indicating the sequences of interest, e.g. exons with their respective coordinates, a reference sequence, and a unique identifier string (DESIGN_NAME).

The tool will output interval_list files of both bait and target sequences as well as the actual bait sequences in FastA format. At least two baits are output for each target sequence, with greater numbers for larger intervals. Although the default values for both bait size (120 bases) nd offsets (80 bases) are suitable for most applications, these values can be customized. Offsets represent the distance between sequential baits on a contiguous stretch of target DNA sequence.

The tool will also output a pooled set of 55,000 (default) oligonucleotides representing all of the baits redundantly. This redundancy achieves a uniform concentration of oligonucleotides for synthesis by a vendor as well as equal numbersof each bait to prevent bias during the hybrid selection reaction.

Usage example:

java -jar picard.jar BaitDesigner \
TARGET=targets.interval_list \
DESIGN_NAME=new_baits \
R=reference_sequence.fasta

OptionDescription
TARGETS (File)The file with design parameters and targets Required.
DESIGN_NAME (String)The name of the bait design Required.
REFERENCE_SEQUENCE (File)The reference sequence fasta file Required.
LEFT_PRIMER (String)The left amplification primer to prepend to all baits for synthesis Default value: ATCGCACCAGCGTGT. This option can be set to 'null' to clear the default value.
RIGHT_PRIMER (String)The right amplification primer to prepend to all baits for synthesis Default value: CACTGCGGCTCCTCA. This option can be set to 'null' to clear the default value.
DESIGN_STRATEGY (DesignStrategy)The design strategy to use to layout baits across each target Default value: FixedOffset. This option can be set to 'null' to clear the default value. Possible values: {CenteredConstrained, FixedOffset, Simple}
BAIT_SIZE (Integer)The length of each individual bait to design Default value: 120. This option can be set to 'null' to clear the default value.
MINIMUM_BAITS_PER_TARGET (Integer)The minimum number of baits to design per target. Default value: 2. This option can be set to 'null' to clear the default value.
BAIT_OFFSET (Integer)The desired offset between the start of one bait and the start of another bait for the same target. Default value: 80. This option can be set to 'null' to clear the default value.
PADDING (Integer)Pad the input targets by this amount when designing baits. Padding is applied on both sides in this amount. Default value: 0. This option can be set to 'null' to clear the default value.
REPEAT_TOLERANCE (Integer)Baits that have more than REPEAT_TOLERANCE soft or hard masked bases will not be allowed Default value: 50. This option can be set to 'null' to clear the default value.
POOL_SIZE (Integer)The size of pools or arrays for synthesis. If no pool files are desired, can be set to 0. Default value: 55000. This option can be set to 'null' to clear the default value.
FILL_POOLS (Boolean)If true, fill up the pools with alternating fwd and rc copies of all baits. Equal copies of all baits will always be maintained Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
DESIGN_ON_TARGET_STRAND (Boolean)If true design baits on the strand of the target feature, if false always design on the + strand of the genome. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MERGE_NEARBY_TARGETS (Boolean)If true merge targets that are 'close enough' that designing against a merged target would be more efficient. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
OUTPUT_AGILENT_FILES (Boolean)If true also output .design.txt files per pool with one line per bait sequence Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
OUTPUT_DIRECTORY (File)The output directory. If not provided then the DESIGN_NAME will be used as the output directory Default value: null.

BamToBfq

Create BFQ files from a BAM file for use by the maq aligner. BFQ is a binary version of the FASTQ file format. This tool creates bfq files from a BAM file for use by the maq aligner.

Usage example:

java -jar picard.jar BamToBfq \
I=input.bam \
ANALYSIS_DIR=analysis_dir \
OUTPUT_FILE_PREFIX=output_file_1 \
PAIRED_RUN=false

OptionDescription
INPUT (File)The BAM file to parse. Required.
ANALYSIS_DIR (File)The analysis directory for the binary output file. Required.
FLOWCELL_BARCODE (String)Flowcell barcode (e.g. 30PYMAAXX). Required. Cannot be used in conjuction with option(s) OUTPUT_FILE_PREFIX
LANE (Integer)Lane number. Default value: null. Cannot be used in conjuction with option(s) OUTPUT_FILE_PREFIX
OUTPUT_FILE_PREFIX (String)Prefix for all output files Required. Cannot be used in conjuction with option(s) FLOWCELL_BARCODE (F) LANE (L)
READS_TO_ALIGN (Integer)Number of reads to align (null = all). Default value: null.
READ_CHUNK_SIZE (Integer)Number of reads to break into individual groups for alignment Default value: 2000000. This option can be set to 'null' to clear the default value.
PAIRED_RUN (Boolean)Whether this is a paired-end run. Required. Possible values: {true, false}
RUN_BARCODE (String)Deprecated option; use READ_NAME_PREFIX instead Default value: null. Cannot be used in conjuction with option(s) READ_NAME_PREFIX
READ_NAME_PREFIX (String)Prefix to be stripped off the beginning of all read names (to make them short enough to run in Maq) Default value: null.
INCLUDE_NON_PF_READS (Boolean)Whether to include non-PF reads Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CLIP_ADAPTERS (Boolean)Whether to clip adapters from the reads Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
BASES_TO_WRITE (Integer)The number of bases from each read to write to the bfq file. If this is non-null, then only the first BASES_TO_WRITE bases from each read will be written. Default value: null.

BamIndexStats

Generate index statistics from a BAM fileThis tool calculates statistics from a BAM index (.bai) file, emulating the behavior of the "samtools idxstats" command. The statistics collected include counts of aligned and unaligned reads as well as all records with no start coordinate. The input to the tool is the BAM file name but it must be accompanied by a corresponding index file.

Usage example:

java -jar picard.jar BamIndexStats \
I=input.bam \
O=output

OptionDescription
INPUT (File)A BAM file to process. Required.

BedToIntervalList

Converts a BED file to a Picard Interval List. This tool provides easy conversion from BED to the Picard interval_list format which is required by many Picard processing tools. Note that the coordinate system of BED files is such that the first base or position in a sequence is numbered "0", while in interval_list files it is numbered "1".

BED files contain sequence data displayed in a flexible format that includes nine optional fields, in addition to three required fields within the annotation tracks. The required fields of a BED file include:

     chrom - The name of the chromosome (e.g. chr20) or scaffold (e.g. scaffold10671) 
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered "0"
chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
In each annotation track, the number of fields per line must be consistent throughout a data set. For additional information regarding BED files and the annotation field options, please see: http://genome.ucsc.edu/FAQ/FAQformat.html#format1.

Interval_list files contain sequence data distributed into intervals. The interval_list file format is relatively simple and reflects the SAM alignment format to a degree. A SAM style header must be present in the file that lists the sequence records against which the intervals are described. After the header, the file then contains records, one per line in plain text format with the following values tab-separated::
      -Sequence name (SN) - The name of the sequence in the file for identification purposes, can be chromosome number e.g. chr20 
-Start position - Interval start position (starts at +1)
-End position - Interval end position (1-based, end inclusive)
-Strand - Indicates +/- strand for the interval (either + or -)
-Interval name - (Each interval should have a unique name)

This tool requires a sequence dictionary, provided with the SEQUENCE_DICTIONARY or SD argument. The value given to this argument can be any of the following:
    - A file with .dict extension generated using Picard's CreateSequenceDictionaryTool
- A reference.fa or reference.fasta file with a reference.dict in the same directory
- Another IntervalList with @SQ lines in the header from which to generate a dictionary
- A VCF that contains #contig lines from which to generate a sequence dictionary
- A SAM or BAM file with @SQ lines in the header from which to generate a dictionary

Usage example:

java -jar picard.jar BedToIntervalList \
I=input.bed \
O=list.interval_list \
SD=reference_sequence.dict



OptionDescription
INPUT (File)The input BED file Required.
SEQUENCE_DICTIONARY (File)The sequence dictionary, or BAM/VCF/IntervalList from which a dictionary can be extracted. Required.
OUTPUT (File)The output Picard Interval List Required.
SORT (Boolean)If true, sort the output interval list before writing it. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
UNIQUE (Boolean)If true, unique the output interval list by merging overlapping regions, before writing it (implies sort=true). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

BuildBamIndex

Generates a BAM index ".bai" file. This tool creates an index file for the input BAM that allows fast look-up of data in a BAM file, lke an index on a database. Note that this tool cannot be run on SAM files, and that the input BAM file must be sorted in coordinate order.

Usage example:

java -jar picard.jar BuildBamIndex \
I=input.bam

OptionDescription
INPUT (String)A BAM file or GA4GH URL to process. Must be sorted in coordinate order. Required.
OUTPUT (File)The BAM index file. Defaults to x.bai if INPUT is x.bam, otherwise INPUT.bai. If INPUT is a URL and OUTPUT is unspecified, defaults to a file in the current directory. Default value: null.

CalculateReadGroupChecksum

Creates a hash code based on the read groups (RG). This tool creates a hash code based on identifying information in the read groups (RG) of a ".BAM" or "SAM" file header. Addition or removal of RGs changes the hash code, enabling the user to quickly determine if changes have been made to the read group information.

Usage example:

java -jar picard.jar CalculateReadGroupChecksum \
I=input.bam
Please see the AddOrReplaceReadGroups tool documentation for information regarding the addition, subtraction, or merging of read groups.

OptionDescription
INPUT (File)The input SAM or BAM file. Required.
OUTPUT (File)The file to which the hash code should be written. Default value: null.

CleanSam

Cleans the provided SAM/BAM, soft-clipping beyond-end-of-reference alignments and setting MAPQ to 0 for unmapped reads

OptionDescription
INPUT (File)Input SAM to be cleaned. Required.
OUTPUT (File)Where to write cleaned SAM. Required.

CollectAlignmentSummaryMetrics

Produces a summary of alignment metrics from a SAM or BAM file. This tool takes a SAM/BAM file input and produces metrics detailing the quality of the read alignments as well as the proportion of the reads that passed machine signal-to-noise threshold quality filters. Note that these quality filters are specific to Illumina data; for additional information, please see the corresponding GATK Dictionary entry.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

    java -jar picard.jar CollectAlignmentSummaryMetrics \
R=reference_sequence.fasta \
I=input.bam \
O=output.txt

Please see the CollectAlignmentSummaryMetrics definitions for a complete description of the metrics produced by this tool.


OptionDescription
MAX_INSERT_SIZE (Integer)Paired-end reads above this insert size will be considered chimeric along with inter-chromosomal pairs. Default value: 100000. This option can be set to 'null' to clear the default value.
EXPECTED_PAIR_ORIENTATIONS (PairOrientation)Paired-end reads that do not have this expected orientation will be considered chimeric. Default value: [FR]. This option can be set to 'null' to clear the default value. Possible values: {FR, RF, TANDEM} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
ADAPTER_SEQUENCE (String)List of adapter sequences to use when processing the alignment metrics. Default value: [AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
IS_BISULFITE_SEQUENCED (Boolean)Whether the SAM or BAM file consists of bisulfite sequenced reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REFERENCE_SEQUENCE (File)Reference sequence file. Note that while this argument isn't required, without it only a small subset of the metrics will be calculated. Note also that if a reference sequence is provided, it must be accompanied by a sequence dictionary. Default value: null.
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectBaseDistributionByCycle

Chart the nucleotide distribution per cycle in a SAM or BAM fileThis tool produces a chart of the nucleotide distribution per cycle in a SAM or BAM file in order to enable assessment of systematic errors at specific positions in the reads.

Interpretation notes

Increased numbers of miscalled bases will be reflected in base distribution changes and increases in the number of Ns. In general, we expect that for any given cycle, or position within reads, the relative proportions of A, T, C and G should reflect the AT:GC content of the organism's genome. Thus, for all four nucleotides, flattish lines would be expected. Deviations from this expectation, for example a spike of A at a particular cycle (position within reads), would suggest a systematic sequencing error.

Note on quality trimming

In the past, many sequencing data processing workflows included discarding the low-quality tails of reads by applying hard-clipping at some arbitrary base quality threshold value. This is no longer useful because most sophisticated analysis tools (such as the GATK variant discovery tools) are quality-aware, meaning that they are able to take base quality into account when weighing evidence provided by sequencing reads. Unnecessary clipping may interfere with other quality control evaluations and may lower the quality of analysis results. For example, trimming reduces the effectiveness of the Base Recalibration (BQSR) pre-processing step of the GATK Best Practices for Variant Discovery, which aims to correct some types of systematic biases that affect the accuracy of base quality scores.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectBaseDistributionByCycle \
CHART=collect_base_dist_by_cycle.pdf \
I=input.bam \
O=output.txt

OptionDescription
CHART_OUTPUT (File)A file (with .pdf extension) to write the chart to. Required.
ALIGNED_READS_ONLY (Boolean)If set to true, calculate the base distribution over aligned reads only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
PF_READS_ONLY (Boolean)If set to true, calculate the base distribution over PF reads only (Illumina specific). PF reads are reads that passed the internal quality filters applied by Illumina sequencers. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectGcBiasMetrics

Collect metrics regarding GC bias. This tool collects information about the relative proportions of guanine (G) and cytosine (C) nucleotides in a sample. Regions of high and low G + C content have been shown to interfere with mapping/aligning, ultimately leading to fragmented genome assemblies and poor coverage in a phenomenon known as 'GC bias'. Detailed information on the effects of GC bias on the collection and analysis of sequencing data can be found at DOI: 10.1371/journal.pone.0062856/.

The GC bias statistics are always output in a detailed long-form version, but a summary can also be produced. Both the detailed metrics and the summary metrics are output as tables '.txt' files) and an accompanying chart that plots the data ('.pdf' file).

Detailed metrics

The table of detailed metrics includes GC percentages for each bin (GC), the percentage of WINDOWS corresponding to each GC bin of the reference sequence, the numbers of reads that start within a particular %GC content bin (READ_STARTS), and the mean base quality of the reads that correspond to a specific GC content distribution window (MEAN_BASE_QUALITY). NORMALIZED_COVERAGE is a relative measure of sequence coverage by the reads at a particular GC content.For each run, the corresponding reference sequence is divided into bins or windows based on the percentage of G + C content ranging from 0 - 100%. The percentages of G + C are determined from a defined length of sequence; the default value is set at 100 bases. The mean of the distribution will vary among organisms; human DNA has a mean GC content of 40%, suggesting a slight preponderance of AT-rich regions.

Summary metrics

The table of summary metrics captures run-specific bias information including WINDOW_SIZE, ALIGNED_READS, TOTAL_CLUSTERS, AT_DROPOUT, and GC_DROPOUT. While WINDOW_SIZE refers to the numbers of bases used for the distribution (see above), the ALIGNED_READS and TOTAL_CLUSTERS are the total number of aligned reads and the total number of reads (after filtering) produced in a run. In addition, the tool produces both AT_DROPOUT and GC_DROPOUT metrics, which indicate the percentage of misaligned reads that correlate with low (%-GC is < 50%) or high (%-GC is > 50%) GC content respectively.

The percentage of 'coverage' or depth in a GC bin is calculated by dividing the number of reads of a particular GC content by the mean number of reads of all GC bins. A number of 1 represents mean coverage, a number less than 1 represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than 1 represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average). This tool also tracks mean base-quality scores of the reads within each GC content bin, enabling the user to determine how base quality scores vary with GC content.

The chart output associated with this data table plots the NORMALIZED_COVERAGE, the distribution of WINDOWs corresponding to GC percentages, and base qualities corresponding to each %GC bin.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example:

java -jar picard.jar CollectGcBiasMetrics \
I=input.bam \
O=gc_bias_metrics.txt \
CHART=gc_bias_metrics.pdf \
S=summary_metrics.txt \
R=reference_sequence.fasta
Please see the GcBiasMetrics documentation for further explanations of each metric.

OptionDescription
CHART_OUTPUT (File)The PDF file to render the chart to. Required.
SUMMARY_OUTPUT (File)The text file to write summary metrics to. Required.
SCAN_WINDOW_SIZE (Integer)The size of the scanning windows on the reference genome that are used to bin reads. Default value: 100. This option can be set to 'null' to clear the default value.
MINIMUM_GENOME_FRACTION (Double)For summary metrics, exclude GC windows that include less than this fraction of the genome. Default value: 1.0E-5. This option can be set to 'null' to clear the default value.
IS_BISULFITE_SEQUENCED (Boolean)Whether the SAM or BAM file consists of bisulfite sequenced reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
ALSO_IGNORE_DUPLICATES (Boolean)Use to get additional results without duplicates. This option allows to gain two plots per level at the same time: one is the usual one and the other excludes duplicates. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectHiSeqXPfFailMetrics

Classify PF-Failing reads in a HiSeqX Illumina Basecalling directory into various categories.

This tool categorizes the reads that did not pass filter (PF-Failing) into four groups. These groups are based on a heuristic that was derived by looking at a few titration experiments.

After examining the called bases from the first 24 cycles of each read, the PF-Failed reads are grouped into the following four categories:

  • MISALIGNED - The first 24 basecalls of a read are uncalled (numNs~24). These types of reads appear to be flow cell artifacts because reads were only found near tile boundaries and were concentration (library) independent
  • EMPTY - All 24 bases are called (numNs~0) but the number of bases with quality scores greater than two is less than or equal to eight (numQGtTwo<=8). These reads were location independent within the tiles and were inversely proportional to the library concentration
  • POLYCLONAL - All 24 bases were called and numQGtTwo>=12, were independent of their location with the tiles, and were directly proportional to the library concentration. These reads are likely the result of PCR artifacts
  • UNKNOWN - The remaining reads that are PF-Failing but did not fit into any of the groups listed above

The tool defaults to the SUMMARY output which indicates the number of PF-Failed reads per tile and groups them into the categories described above accordingly.

A DETAILED metrics option is also available that subdivides the SUMMARY outputs by the x- y- position of these reads within each tile. To obtain the DETAILED metric table, you must add the PROB_EXPLICIT_READS option to your command line and set the value between 0 and 1. This value represents the fractional probability of PF-Failed reads to send to output. For example, if PROB_EXPLICIT_READS=0, then no metrics will be output. If PROB_EXPLICIT_READS=1, then it will provide detailed metrics for all (100%) of the reads. It follows that setting the PROB_EXPLICIT_READS=0.5, will provide detailed metrics for half of the PF-Failed reads.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example: (SUMMARY Metrics)

java -jar picard.jar CollectHiSeqXPfFailMetrics \
BASECALLS_DIR=/BaseCalls/ \
OUTPUT=/metrics/ \
LANE=001

Usage example: (DETAILED Metrics)

java -jar picard.jar CollectHiSeqXPfFailMetrics \
BASECALLS_DIR=/BaseCalls/ \
OUTPUT=/Detail_metrics/ \
LANE=001 \
PROB_EXPLICIT_READS=1
Please see our documentation on the SUMMARY and DETAILED metrics for comprehensive explanations of the outputs produced by this tool.

OptionDescription
BASECALLS_DIR (File)The Illumina basecalls directory. Required.
OUTPUT (File)Basename for metrics file. Resulting file will be .pffail_summary_metrics Required.
PROB_EXPLICIT_READS (Double)The fraction of (non-PF) reads for which to output explicit classification. Output file will be .pffail_detailed_metrics (if PROB_EXPLICIT_READS != 0) Default value: 0.0. This option can be set to 'null' to clear the default value.
LANE (Integer)Lane number. Required.
NUM_PROCESSORS (Integer)Run this many PerTileBarcodeExtractors in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0 then the number of cores used will be the number available on the machine less NUM_PROCESSORS. Default value: 1. This option can be set to 'null' to clear the default value.
N_CYCLES (Integer)Number of cycles to look at. At time of writing PF status gets determined at cycle 24 so numbers greater than this will yield strange results. In addition, PF status is currently determined at cycle 24, so running this with any other value is neither tested nor recommended. Default value: 24. This option can be set to 'null' to clear the default value.

CollectHsMetrics

Collects hybrid-selection (HS) metrics for a SAM or BAM file. This tool takes a SAM/BAM file input and collects metrics that are specific for sequence datasets generated through hybrid-selection. Hybrid-selection (HS) is the most commonly used technique to capture exon-specific sequences for targeted sequencing experiments such as exome sequencing; for more information, please see the corresponding GATK Dictionary entry.

This tool requires an aligned SAM or BAM file as well as bait and target interval files in Picard interval_list format. You should use the bait and interval files that correspond to the capture kit that was used to generate the capture libraries for sequencing, which can generally be obtained from the kit manufacturer. If the baits and target intervals are provided in BED format, you can convert them to the Picard interval_list format using Picard's BedToInterval tool.

If a reference sequence is provided, this program will calculate both AT_DROPOUT and GC_DROPOUT metrics. Dropout metrics are an attempt to measure the reduced representation of reads, in regions that deviate from 50% G/C content. This reduction in the number of aligned reads is due to the increased numbers of errors associated with sequencing regions with excessive or deficient numbers of G/C bases, ultimately leading to poor mapping efficiencies and lowcoverage in the affected regions.

If you are interested in getting G/C content and mean sequence depth information for every target interval, use the PER_TARGET_COVERAGE option.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example:

java -jar picard.jar CollectHsMetrics \
I=input.bam \
O=hs_metrics.txt \
R=reference_sequence.fasta \
BAIT_INTERVALS=bait.interval_list \
TARGET_INTERVALS=target.interval_list

Please see CollectHsMetrics for detailed descriptions of the output metrics produced by this tool.


OptionDescription
BAIT_INTERVALS (File)An interval list file that contains the locations of the baits used. Default value: null. This option must be specified at least 1 times.
BAIT_SET_NAME (String)Bait set name. If not provided it is inferred from the filename of the bait intervals. Default value: null.
MINIMUM_MAPPING_QUALITY (Integer)Minimum mapping quality for a read to contribute coverage. Default value: 20. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality for a base to contribute coverage. Default value: 20. This option can be set to 'null' to clear the default value.
CLIP_OVERLAPPING_READS (Boolean)True if we are to clip overlapping reads, false otherwise. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
TARGET_INTERVALS (File)An interval list file that contains the locations of the targets. Default value: null. This option must be specified at least 1 times.
INPUT (File)An aligned SAM or BAM file. Required.
OUTPUT (File)The output file to write the metrics to. Required.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
PER_TARGET_COVERAGE (File)An optional file to output per target coverage information to. Default value: null.
PER_BASE_COVERAGE (File)An optional file to output per base coverage information to. The per-base file contains one line per target base and can grow very large. It is not recommended for use with large target sets. Default value: null.
NEAR_DISTANCE (Integer)The maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered 'near probe' and included in percent selected. Default value: 250. This option can be set to 'null' to clear the default value.
COVERAGE_CAP (Integer)Parameter to set a max coverage limit for Theoretical Sensitivity calculations. Default is 200. Default value: 200. This option can be set to 'null' to clear the default value.
SAMPLE_SIZE (Integer)Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. Default value: 10000. This option can be set to 'null' to clear the default value.

CollectIlluminaBasecallingMetrics

Collects Illumina Basecalling metrics for a sequencing run.

This tool will produce per-barcode and per-lane basecall metrics for each sequencing run. Mean values for each metric are determined using data from all of the tiles. This tool requires the following data, LANE(#), BASECALLS_DIR, READ_STRUCTURE, and an input file listing the sample barcodes. Program will provide metrics including: the total numbers of bases, reads, and clusters, as well as the fractions of each bases, reads, and clusters that passed Illumina quality filters (PF) both per barcode and per lane. For additional information on Illumina's PF quality metric, please see the corresponding GATK Dictionary entry.

The input barcode_list.txt file is a file containing all of the sample and molecular barcodes and can be obtained from the ExtractIlluminaBarcodes tool.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectIlluminaBasecallingMetrics \
BASECALLS_DIR=/BaseCalls/ \
LANE=001 \
READ_STRUCTURE=25T8B25T \
INPUT=barcode_list.txt

Please see the CollectIlluminaBasecallingMetrics definitions for a complete description of the metrics produced by this tool.


OptionDescription
BASECALLS_DIR (File)The Illumina basecalls output directory from which data are read Required.
BARCODES_DIR (File)The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR. Default value: null.
LANE (Integer)The lane whose data will be read Required.
INPUT (File)The file containing barcodes to expect from the run - barcodeData.# Default value: null.
READ_STRUCTURE (String)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. Required.
OUTPUT (File)The file to which the collected metrics are written Default value: null.

CollectIlluminaLaneMetrics

Collects Illumina lane metrics for the given BaseCalling analysis directory. This tool produces quality control metrics on cluster density for each lane of an Illumina flowcell. This tool takes Illumina TileMetrics data and places them into directories containing lane- and phasing-level metrics. In this context, phasing refers to the fraction of molecules that fall behind or jump ahead (prephasing) during a read cycle.

Usage example:

java -jar picard.jar CollectIlluminaLaneMetrics \
RUN_DIR=test_run \
OUTPUT_DIRECTORY=Lane_output_metrics \
OUTPUT_PREFIX=experiment1 \
READ_STRUCTURE=25T8B25T

Please see the CollectIlluminaLaneMetrics definitions for a complete description of the metrics produced by this tool.


OptionDescription
RUN_DIRECTORY (File)The Illumina run directory of the run for which the lane metrics are to be generated Required.
OUTPUT_DIRECTORY (File)The directory to which the output file will be written Required.
OUTPUT_PREFIX (String)The prefix to be prepended to the file name of the output file; an appropriate suffix will be applied Required.
READ_STRUCTURE (ReadStructure)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. If not given, will use the RunInfo.xml in the run directory. Default value: null.
FILE_EXTENSION (String)Append the given file extension to all metric file names (ex. OUTPUT.illumina_lane_metrics.EXT). None if null Default value: null.
IS_NOVASEQ (Boolean)Boolean the determines if this run is a NovaSeq run or not. (NovaSeq tile metrics files are in cycle 25 directory. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

CollectInsertSizeMetrics

This tool provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries.

The expected proportions of these metrics vary depending on the type of library preparation used, resulting from technical differences between pair-end libraries and mate-pair libraries. For a brief primer on paired-end sequencing and mate-pair reads, see the GATK Dictionary.

The CollectInsertSizeMetrics tool outputs the percentages of read pairs in each of the three orientations (FR, RF, and TANDEM) as a histogram. In addition, the insert size distribution is output as both a histogram (.insert_size_Histogram.pdf) and as a data table (.insert_size_metrics.txt).

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectInsertSizeMetrics \
I=input.bam \
O=insert_size_metrics.txt \
H=insert_size_histogram.pdf \
M=0.5
Note: If processing a small file, set the minimum percentage option (M) to 0.5, otherwise an error may occur.

Please see InsertSizeMetrics for detailed explanations of each metric.
Collect metrics about the insert size distribution of a paired-end library.

OptionDescription
HISTOGRAM_FILE (File)File to write insert size Histogram chart to. Required.
DEVIATIONS (Double)Generate mean, sd and plots by trimming the data down to MEDIAN + DEVIATIONS*MEDIAN_ABSOLUTE_DEVIATION. This is done because insert size data typically includes enough anomalous values from chimeras and other artifacts to make the mean and sd grossly misleading regarding the real distribution. Default value: 10.0. This option can be set to 'null' to clear the default value.
HISTOGRAM_WIDTH (Integer)Explicitly sets the Histogram width, overriding automatic truncation of Histogram tail. Also, when calculating mean and standard deviation, only bins <= Histogram_WIDTH will be included. Default value: null.
MINIMUM_PCT (Float)When generating the Histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1). Default value: 0.05. This option can be set to 'null' to clear the default value.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
INCLUDE_DUPLICATES (Boolean)If true, also include reads marked as duplicates in the insert size histogram. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectJumpingLibraryMetrics

Collect jumping library metrics.

This tool collects high-level metrics about the presence of outward-facing (jumping) and inward-facing (non-jumping) read pairs within a SAM or BAM file.For a brief primer on jumping libraries, see the GATK Dictionary

.

This program gets all data for computation from the first read in each pair in which the mapping quality (MQ) tag is set with the mate's mapping quality. If the MQ tag is not set, then the program assumes that the mate's MQ is greater than or equal to MINIMUM_MAPPING_QUALITY (default value is 0).

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectJumpingLibraryMetrics \
I=input.bam \
O=jumping_metrics.txt
Please see the output metrics documentation on JumpingLibraryMetrics for detailed explanations of the output metrics.

OptionDescription
INPUT (File)BAM file(s) of reads with duplicates marked Default value: null. This option may be specified 0 or more times.
OUTPUT (File)File to which metrics should be written Required.
MINIMUM_MAPPING_QUALITY (Integer)Mapping quality minimum cutoff Default value: 0. This option can be set to 'null' to clear the default value.
TAIL_LIMIT (Integer)When calculating mean and stdev stop when the bins in the tail of the distribution contain fewer than mode/TAIL_LIMIT items Default value: 10000. This option can be set to 'null' to clear the default value.
CHIMERA_KB_MIN (Integer)Jumps greater than or equal to the greater of this value or 2 times the mode of the outward-facing pairs are considered chimeras Default value: 100000. This option can be set to 'null' to clear the default value.

CollectMultipleMetrics

Collect multiple classes of metrics. This 'meta-metrics' tool runs one or more of the metrics collection modules at the same time to cut down on the time spent reading in data from input files. Available modules include CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, QualityScoreDistribution, MeanQualityByCycle, CollectBaseDistributionByCycle, CollectGcBiasMetrics, RnaSeqMetrics, CollectSequencingArtifactMetrics, and CollectQualityYieldMetrics. The tool produces outputs of '.pdf' and '.txt' files for each module, except for the CollectAlignmentSummaryMetrics module, which outputs only a '.txt' file. Output files are named by specifying a base name (without any file extensions).

Currently all programs are run with default options and fixed output extensions, but this may become more flexible in future. Specifying a reference sequence file is required.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example (all modules on by default):

java -jar picard.jar CollectMultipleMetrics \
I=input.bam \
O=multiple_metrics \
R=reference_sequence.fasta

Usage example (two modules only):

java -jar picard.jar CollectMultipleMetrics \
I=input.bam \
O=multiple_metrics \
R=reference_sequence.fasta \
PROGRAM=null \
PROGRAM=QualityScoreDistribution \
PROGRAM=MeanQualityByCycle

OptionDescription
INPUT (File)Input SAM or BAM file. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Integer)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.
OUTPUT (String)Base name of output files. Required.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
FILE_EXTENSION (String)Append the given file extension to all metric file names (ex. OUTPUT.insert_size_metrics.EXT). None if null Default value: null.
PROGRAM (Program)Set of metrics programs to apply during the pass through the SAM file. Default value: [CollectAlignmentSummaryMetrics, CollectBaseDistributionByCycle, CollectInsertSizeMetrics, MeanQualityByCycle, QualityScoreDistribution]. This option can be set to 'null' to clear the default value. Possible values: {CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, QualityScoreDistribution, MeanQualityByCycle, CollectBaseDistributionByCycle, CollectGcBiasMetrics, RnaSeqMetrics, CollectSequencingArtifactMetrics, CollectQualityYieldMetrics} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
INTERVALS (File)An optional list of intervals to restrict analysis to. Only pertains to some of the PROGRAMs. Programs whose stand-alone CLP does not have an INTERVALS argument will silently ignore this argument. Default value: null.
DB_SNP (File)VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis by some PROGRAMs; PROGRAMs whose CLP doesn't allow for this argument will quietly ignore it. Default value: null.
INCLUDE_UNPAIRED (Boolean)Include unpaired reads in CollectSequencingArtifactMetrics. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored in CollectSequencingArtifactMetrics. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

CollectOxoGMetrics

Collect metrics to assess oxidative artifacts.This tool collects metrics quantifying the error rate resulting from oxidative artifacts. For a brief primer on oxidative artifacts, see the GATK Dictionary.

This tool calculates the Phred-scaled probability that an alternate base call results from an oxidation artifact. This probability score is based on base context, sequencing read orientation, and the characteristic low allelic frequency. Please see the following reference for an in-depth discussion of the OxoG error rate.

Lower probability values implicate artifacts resulting from 8-oxoguanine, while higher probability values suggest that an alternate base call is due to either some other type of artifact or is a real variant.

Usage example:

java -jar picard.jar CollectOxoGMetrics \
I=input.bam \
O=oxoG_metrics.txt \
R=reference_sequence.fasta

OptionDescription
INPUT (File)Input BAM file for analysis. Required.
OUTPUT (File)Location of output metrics file to write. Required.
REFERENCE_SEQUENCE (File)Reference sequence to which BAM is aligned. Required.
INTERVALS (File)An optional list of intervals to restrict analysis to. Default value: null.
DB_SNP (File)VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis. Default value: null.
MINIMUM_QUALITY_SCORE (Integer)The minimum base quality score for a base to be included in analysis. Default value: 20. This option can be set to 'null' to clear the default value.
MINIMUM_MAPPING_QUALITY (Integer)The minimum mapping quality score for a base to be included in analysis. Default value: 30. This option can be set to 'null' to clear the default value.
MINIMUM_INSERT_SIZE (Integer)The minimum insert size for a read to be included in analysis. Set of 0 to allow unpaired reads. Default value: 60. This option can be set to 'null' to clear the default value.
MAXIMUM_INSERT_SIZE (Integer)The maximum insert size for a read to be included in analysis. Set of 0 to allow unpaired reads. Default value: 600. This option can be set to 'null' to clear the default value.
INCLUDE_NON_PF_READS (Boolean)Whether or not to include non-PF reads. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
USE_OQ (Boolean)When available, use original quality scores for filtering. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CONTEXT_SIZE (Integer)The number of context bases to include on each side of the assayed G/C base. Default value: 1. This option can be set to 'null' to clear the default value.
CONTEXTS (String)The optional set of sequence contexts to restrict analysis to. If not supplied all contexts are analyzed. Default value: null. This option may be specified 0 or more times.
STOP_AFTER (Integer)For debugging purposes: stop after visiting this many sites with at least 1X coverage. Default value: 2147483647. This option can be set to 'null' to clear the default value.

CollectQualityYieldMetrics

Collect metrics about reads that pass quality thresholds and Illumina-specific filters. This tool evaluates the overall quality of reads within a bam file containing one read group. The output indicates the total numbers of bases within a read group that pass a minimum base quality score threshold and (in the case of Illumina data) pass Illumina quality filters as described in the GATK Dictionary entry.

Note on base quality score options

If the quality score of read bases has been modified in a previous data processing step such as GATK Base Recalibration and an OQ tag is available, this tool can be set to use the OQ value instead of the primary quality value for the evaluation.

Note that the default behaviour of this program changed as of November 6th 2015 to no longer include secondary and supplemental alignments in the computation.

Usage Example:

java -jar picard.jar CollectQualityYieldMetrics \
I=input.bam \
O=quality_yield_metrics.txt \
Please see the QualityYieldMetrics documentation for details and explanations of the output metrics.

OptionDescription
USE_ORIGINAL_QUALITIES (Boolean)If available in the OQ tag, use the original quality scores as inputs instead of the quality scores in the QUAL field. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_SECONDARY_ALIGNMENTS (Boolean)If true, include bases from secondary alignments in metrics. Setting to true may cause double-counting of bases if there are secondary alignments in the input file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_SUPPLEMENTAL_ALIGNMENTS (Boolean)If true, include bases from supplemental alignments in metrics. Setting to true may cause double-counting of bases if there are supplemental alignments in the input file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectRawWgsMetrics

Collect whole genome sequencing-related metrics. This tool computes metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments. These metrics include the percentages of reads that pass minimal base- and mapping- quality filters as well as coverage (read-depth) levels.

The histogram output is optional and for a given run, displays two separate outputs on the y-axis while using a single set of values for the x-axis. Specifically, the first column in the histogram table (x-axis) is labeled 'coverage' and represents different possible coverage depths. However, it also represents the range of values for the base quality scores and thus should probably be labeled 'sequence depth and base quality scores'. The second and third columns (y-axes) correspond to the numbers of bases at a specific sequence depth 'count' and the numbers of bases at a particular base quality score 'baseq_count' respectively.

Although similar to the CollectWgsMetrics tool, the default thresholds for CollectRawWgsMetrics are less stringent. For example, the CollectRawWgsMetrics have base and mapping quality score thresholds set to '3' and '0' respectively, while the CollectWgsMetrics tool has the default threshold values set to '20' (at time of writing). Nevertheless, both tools enable the user to input specific threshold values.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectRawWgsMetrics \
I=input.bam \
O=raw_wgs_metrics.txt \
R=reference_sequence.fasta \
INCLUDE_BQ_HISTOGRAM=true

Please see the WgsMetrics documentation for detailed explanations of the output metrics.

OptionDescription
MINIMUM_MAPPING_QUALITY (Integer)Minimum mapping quality for a read to contribute coverage. Default value: 0. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality for a base to contribute coverage. Default value: 3. This option can be set to 'null' to clear the default value.
COVERAGE_CAP (Integer)Treat bases with coverage exceeding this value as if they had coverage at this value. Default value: 100000. This option can be set to 'null' to clear the default value.
LOCUS_ACCUMULATION_CAP (Integer)At positions with coverage exceeding this value, completely ignore reads that accumulate beyond this value (so that they will not be considered for PCT_EXC_CAPPED). Used to keep memory consumption in check, but could create bias if set too low Default value: 200000. This option can be set to 'null' to clear the default value.
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)Output metrics file. Required.
REFERENCE_SEQUENCE (File)The reference sequence fasta aligned to. Required.
STOP_AFTER (Long)For debugging purposes, stop after processing this many genomic bases. Default value: -1. This option can be set to 'null' to clear the default value.
INCLUDE_BQ_HISTOGRAM (Boolean)Determines whether to include the base quality histogram in the metrics file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
COUNT_UNPAIRED (Boolean)If true, count unpaired reads, and paired reads with one end unmapped Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SAMPLE_SIZE (Integer)Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. Default value: 10000. This option can be set to 'null' to clear the default value.
USE_FAST_ALGORITHM (Boolean)If true, fast algorithm is used. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
READ_LENGTH (Integer)Average read length in the file. Default is 150. Default value: 150. This option can be set to 'null' to clear the default value.
INTERVALS (File)An interval list file that contains the positions to restrict the assessment. Please note that all bases of reads that overlap these intervals will be considered, even if some of those bases extend beyond the boundaries of the interval. The ideal use case for this argument is to use it to restrict the calculation to a subset of (whole) contigs. Default value: null.

CollectTargetedPcrMetrics

Calculate PCR-related metrics from targeted sequencing data.

This tool calculates a set of PCR-related metrics from an aligned SAM or BAM file containing targeted sequencing data. It is appropriate for data produced with multiple small-target technologies including exome sequencing an custom amplicon panels such as the Illumina TruSeq Custom Amplicon (TSCA) kit.

If a reference sequence is provided, AT/GC dropout metrics will be calculated and the PER_TARGET_COVERAGE option can be used to output GC content and mean coverage information for each target. The AT/GC dropout metrics indicate the degree of inadequate coverage of a particular region based on its AT or GC content. The PER_TARGET_COVERAGE option can be used to output GC content and mean sequence depth information for every target interval.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example

java -jar picard.jar CollectTargetedPcrMetrics \
I=input.bam \
O=pcr_metrics.txt \
R=reference_sequence.fasta \
AMPLICON_INTERVALS=amplicon.interval_list \
TARGET_INTERVALS=targets.interval_list
Please see the metrics definitions page on TargetedPcrMetrics for detailed explanations of the output metrics produced by this tool.

OptionDescription
AMPLICON_INTERVALS (File)An interval list file that contains the locations of the baits used. Required.
CUSTOM_AMPLICON_SET_NAME (String)Custom amplicon set name. If not provided it is inferred from the filename of the AMPLICON_INTERVALS intervals. Default value: null.
TARGET_INTERVALS (File)An interval list file that contains the locations of the targets. Default value: null. This option must be specified at least 1 times.
INPUT (File)An aligned SAM or BAM file. Required.
OUTPUT (File)The output file to write the metrics to. Required.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
PER_TARGET_COVERAGE (File)An optional file to output per target coverage information to. Default value: null.
PER_BASE_COVERAGE (File)An optional file to output per base coverage information to. The per-base file contains one line per target base and can grow very large. It is not recommended for use with large target sets. Default value: null.
NEAR_DISTANCE (Integer)The maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered 'near probe' and included in percent selected. Default value: 250. This option can be set to 'null' to clear the default value.
MINIMUM_MAPPING_QUALITY (Integer)Minimum mapping quality for a read to contribute coverage. Default value: 1. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality for a base to contribute coverage. Default value: 0. This option can be set to 'null' to clear the default value.
CLIP_OVERLAPPING_READS (Boolean)True if we are to clip overlapping reads, false otherwise. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
COVERAGE_CAP (Integer)Parameter to set a max coverage limit for Theoretical Sensitivity calculations. Default is 200. Default value: 200. This option can be set to 'null' to clear the default value.
SAMPLE_SIZE (Integer)Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. Default value: 10000. This option can be set to 'null' to clear the default value.

CollectRnaSeqMetrics

Produces RNA alignment metrics for a SAM or BAM file.

This tool takes a SAM/BAM file containing the aligned reads from an RNAseq experiment and produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES). For more information please see the corresponding GATK Dictionary entry.

Other metrics include the median coverage (depth), the ratios of 5 prime /3 prime-biases, and the numbers of reads with the correct/incorrect strand designation. The 5 prime /3 prime-bias results from errors introduced by reverse transcriptase enzymes during library construction, ultimately leading to the over-representation of either the 5 prime or 3 prime ends of transcripts. Please see the CollectRnaSeqMetrics definitions for details on how these biases are calculated.

The sequence input must be a valid SAM/BAM file containing RNAseq data aligned by an RNAseq-aware genome aligner such a STAR or TopHat. The tool also requires a REF_FLAT file, a tab-delimited file containing information about the location of RNA transcripts, exon start and stop sites, etc. For an example refFlat file for GRCh38, see refFlat.txt.gz at http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database. The first five lines of the tab-limited text file appear as follows.

DDX11L1	NR_046018	chr1	+	11873	14409	14409	14409	3	11873,12612,13220,	12227,12721,14409,WASH7P	NR_024540	chr1	-	14361	29370	29370	29370	11	14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,	14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370,DLGAP2-AS1	NR_103863	chr8_KI270926v1_alt	-	33083	35050	35050	35050	3	33083,33761,35028,	33281,33899,35050,MIR570	NR_030296	chr3	+	195699400	195699497	195699497	195699497	1	195699400,	195699497,MIR548A3	NR_030330	chr8	-	104484368	104484465	104484465	104484465	1	104484368,	104484465,

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar CollectRnaSeqMetrics \
I=input.bam \
O=output.RNA_Metrics \
REF_FLAT=ref_flat.txt \
STRAND=SECOND_READ_TRANSCRIPTION_STRAND \
RIBOSOMAL_INTERVALS=ribosomal.interval_list
Please see the CollectRnaSeqMetrics definitions for a complete description of the metrics produced by this tool.

OptionDescription
REF_FLAT (File)Gene annotations in refFlat form. Format described here: http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#RefFlat Required.
RIBOSOMAL_INTERVALS (File)Location of rRNA sequences in genome, in interval_list format. If not specified no bases will be identified as being ribosomal. Format described here: Default value: null.
STRAND_SPECIFICITY (StrandSpecificity)For strand-specific library prep. For unpaired reads, use FIRST_READ_TRANSCRIPTION_STRAND if the reads are expected to be on the transcription strand. Required. Possible values: {NONE, FIRST_READ_TRANSCRIPTION_STRAND, SECOND_READ_TRANSCRIPTION_STRAND}
MINIMUM_LENGTH (Integer)When calculating coverage based values (e.g. CV of coverage) only use transcripts of this length or greater. Default value: 500. This option can be set to 'null' to clear the default value.
CHART_OUTPUT (File)The PDF file to write out a plot of normalized position vs. coverage. Default value: null.
IGNORE_SEQUENCE (String)If a read maps to a sequence specified with this option, all the bases in the read are counted as ignored bases. These reads are not counted as Default value: null. This option may be specified 0 or more times.
RRNA_FRAGMENT_PERCENTAGE (Double)This percentage of the length of a fragment must overlap one of the ribosomal intervals for a read or read pair to be considered rRNA. Default value: 0.8. This option can be set to 'null' to clear the default value.
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectRrbsMetrics

Collects metrics from reduced representation bisulfite sequencing (Rrbs) data.

This tool uses reduced representation bisulfite sequencing (Rrbs) data to determine cytosine methylation status across all reads of a genomic DNA sequence. For a primer on bisulfite sequencing and cytosine methylation, please see the corresponding GATK Dictionary entry.

Briefly, bisulfite reduction converts un-methylated cytosine (C) to uracil (U) bases. Methylated sites are not converted because they are resistant to bisulfite reduction. Subsequent to PCR amplification of the reaction products, bisulfite reduction manifests as [C -> T (+ strand) or G -> A (- strand)] base conversions. Thus, conversion rates can be calculated from the reads as follows: [CR = converted/(converted + unconverted)]. Since methylated cytosines are protected against Rrbs-mediated conversion, the methylation rate (MR) is as follows:[MR = unconverted/(converted + unconverted) = (1 - CR)].

The CpG CollectRrbsMetrics tool outputs three files including summary and detail metrics tables as well as a PDF file containing four graphs. These graphs are derived from the summary table and include a comparison of conversion rates for both CpG and non-CpG sites, the distribution of total numbers of CpG sites as a function of the CpG conversion rates, the distribution of CpG sites by the level of read coverage (depth), and the numbers of reads discarded resulting from either exceeding the mismatch rate or size (too short). The detailed metrics table includes the coordinates of all of the CpG sites for the experiment as well as the conversion rates observed for each site.

Usage example:

java -jar picard.jar CollectRrbsMetrics \
R=reference_sequence.fasta \
I=input.bam \
M=basename_for_metrics_files

Please see the CollectRrbsMetrics definitions for a complete description of both the detail and summary metrics produced by this tool.


OptionDescription
INPUT (File)The BAM or SAM file containing aligned reads. Must be coordinate sorted Required.
METRICS_FILE_PREFIX (String)Base name for output files Required.
REFERENCE (File)The reference sequence fasta file Required.
MINIMUM_READ_LENGTH (Integer)Minimum read length Default value: 5. This option can be set to 'null' to clear the default value.
C_QUALITY_THRESHOLD (Integer)Threshold for base quality of a C base before it is considered Default value: 20. This option can be set to 'null' to clear the default value.
NEXT_BASE_QUALITY_THRESHOLD (Integer)Threshold for quality of a base next to a C before the C base is considered Default value: 10. This option can be set to 'null' to clear the default value.
MAX_MISMATCH_RATE (Double)Maximum percentage of mismatches in a read for it to be considered, with a range of 0-1 Default value: 0.1. This option can be set to 'null' to clear the default value.
SEQUENCE_NAMES (String)Set of sequence names to consider, if not specified all sequences will be used Default value: null. This option may be specified 0 or more times.
ASSUME_SORTED (Boolean)If true, assume that the input file is coordinate sorted even if the header says otherwise. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
METRIC_ACCUMULATION_LEVEL (MetricAccumulationLevel)The level(s) at which to accumulate metrics. Default value: [ALL_READS]. This option can be set to 'null' to clear the default value. Possible values: {ALL_READS, SAMPLE, LIBRARY, READ_GROUP} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.

CollectSequencingArtifactMetrics

Collect metrics to quantify single-base sequencing artifacts.

This tool examines two sources of sequencing errors associated with hybrid selection protocols. These errors are divided into two broad categories, pre-adapter and bait-bias. Pre-adapter errors can arise from laboratory manipulations of a nucleic acid sample e.g. shearing and occur prior to the ligation of adapters for PCR amplification (hence the name pre-adapter).

Bait-bias artifacts occur during or after the target selection step, and correlate with substitution rates that are 'biased', or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, during the target selection step, a (G>T) artifact might result in a higher substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the 'G-Ref' artifact.

For additional information on these types of artifacts, please see the corresponding GATK dictionary entries on bait-bias and pre-adapter artifacts.

This tool produces four files; summary and detail metrics files for both pre-adapter and bait-bias artifacts. The detailed metrics show the error rates for each type of base substitution within every possible triplet base configuration. Error rates associated with these substitutions are Phred-scaled and provided as quality scores, the lower the value, the more likely it is that an alternate base call is due to an artifact. The summary metrics provide likelihood information on the 'worst-case' errors.

Usage example:

java -jar picard.jar CollectSequencingArtifactMetrics \
I=input.bam \
O=artifact_metrics.txt \
R=reference_sequence.fasta
Please see the metrics at the following links PreAdapterDetailMetrics, PreAdapterSummaryMetrics, BaitBiasDetailMetrics, and BaitBiasSummaryMetrics for complete descriptions of the output metrics produced by this tool.

OptionDescription
INTERVALS (File)An optional list of intervals to restrict analysis to. Default value: null.
DB_SNP (File)VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis. Default value: null.
MINIMUM_QUALITY_SCORE (Integer)The minimum base quality score for a base to be included in analysis. Default value: 20. This option can be set to 'null' to clear the default value.
MINIMUM_MAPPING_QUALITY (Integer)The minimum mapping quality score for a base to be included in analysis. Default value: 30. This option can be set to 'null' to clear the default value.
MINIMUM_INSERT_SIZE (Integer)The minimum insert size for a read to be included in analysis. Default value: 60. This option can be set to 'null' to clear the default value.
MAXIMUM_INSERT_SIZE (Integer)The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum. Default value: 600. This option can be set to 'null' to clear the default value.
INCLUDE_UNPAIRED (Boolean)Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_DUPLICATES (Boolean)Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_NON_PF_READS (Boolean)Whether or not to include non-PF reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
TANDEM_READS (Boolean)Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
USE_OQ (Boolean)When available, use original quality scores for filtering. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CONTEXT_SIZE (Integer)The number of context bases to include on each side of the assayed base. Default value: 1. This option can be set to 'null' to clear the default value.
CONTEXTS_TO_PRINT (String)If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration. Default value: null. This option may be specified 0 or more times.
FILE_EXTENSION (String)Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null Default value: null.
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

CollectVariantCallingMetrics

Collects per-sample and aggregate (spanning all samples) metrics from the provided VCF file.

OptionDescription
INPUT (File)Input vcf file for analysis Required.
OUTPUT (File)Path (except for the file extension) of output metrics files to write. Required.
DBSNP (File)Reference dbSNP file in dbSNP or VCF format. Required.
TARGET_INTERVALS (File)Target intervals to restrict analysis to. Default value: null.
SEQUENCE_DICTIONARY (File)If present, speeds loading of dbSNP file, will look for dictionary in vcf if not present here. Default value: null.
GVCF_INPUT (Boolean)Set to true if running on a single-sample gvcf. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
THREAD_COUNT (Integer)Default value: 1. This option can be set to 'null' to clear the default value.

CollectWgsMetrics

Collect metrics about coverage and performance of whole genome sequencing (WGS) experiments.

This tool collects metrics about the fractions of reads that pass base- and mapping-quality filters as well as coverage (read-depth) levels for WGS analyses. Both minimum base- and mapping-quality values as well as the maximum read depths (coverage cap) are user defined.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example:

java -jar picard.jar CollectWgsMetrics \
I=input.bam \
O=collect_wgs_metrics.txt \
R=reference_sequence.fasta
Please see CollectWgsMetrics for detailed explanations of the output metrics.

OptionDescription
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)Output metrics file. Required.
REFERENCE_SEQUENCE (File)The reference sequence fasta aligned to. Required.
MINIMUM_MAPPING_QUALITY (Integer)Minimum mapping quality for a read to contribute coverage. Default value: 20. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality for a base to contribute coverage. N bases will be treated as having a base quality of negative infinity and will therefore be excluded from coverage regardless of the value of this parameter. Default value: 20. This option can be set to 'null' to clear the default value.
COVERAGE_CAP (Integer)Treat positions with coverage exceeding this value as if they had coverage at this value (but calculate the difference for PCT_EXC_CAPPED). Default value: 250. This option can be set to 'null' to clear the default value.
LOCUS_ACCUMULATION_CAP (Integer)At positions with coverage exceeding this value, completely ignore reads that accumulate beyond this value (so that they will not be considered for PCT_EXC_CAPPED). Used to keep memory consumption in check, but could create bias if set too low Default value: 100000. This option can be set to 'null' to clear the default value.
STOP_AFTER (Long)For debugging purposes, stop after processing this many genomic bases. Default value: -1. This option can be set to 'null' to clear the default value.
INCLUDE_BQ_HISTOGRAM (Boolean)Determines whether to include the base quality histogram in the metrics file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
COUNT_UNPAIRED (Boolean)If true, count unpaired reads, and paired reads with one end unmapped Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SAMPLE_SIZE (Integer)Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. Default value: 10000. This option can be set to 'null' to clear the default value.
USE_FAST_ALGORITHM (Boolean)If true, fast algorithm is used. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
READ_LENGTH (Integer)Average read length in the file. Default is 150. Default value: 150. This option can be set to 'null' to clear the default value.
INTERVALS (File)An interval list file that contains the positions to restrict the assessment. Please note that all bases of reads that overlap these intervals will be considered, even if some of those bases extend beyond the boundaries of the interval. The ideal use case for this argument is to use it to restrict the calculation to a subset of (whole) contigs. Default value: null.

CollectWgsMetricsWithNonZeroCoverage

Collect metrics about coverage and performance of whole genome sequencing (WGS) experiments. This tool collects metrics about the percentages of reads that pass base- and mapping- quality filters as well as coverage (read-depth) levels. Both minimum base- and mapping-quality values as well as the maximum read depths (coverage cap) are user defined. This extends CollectWgsMetrics by including metrics related only to siteswith non-zero (>0) coverage.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example:

java -jar picard.jar CollectWgsMetricsWithNonZeroCoverage \
I=input.bam \
O=collect_wgs_metrics.txt \
CHART=collect_wgs_metrics.pdf \
R=reference_sequence.fasta
Please see the WgsMetricsWithNonZeroCoverage documentation for detailed explanations of the output metrics.

OptionDescription
CHART_OUTPUT (File)A file (with .pdf extension) to write the chart to. Required.
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)Output metrics file. Required.
REFERENCE_SEQUENCE (File)The reference sequence fasta aligned to. Required.
MINIMUM_MAPPING_QUALITY (Integer)Minimum mapping quality for a read to contribute coverage. Default value: 20. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality for a base to contribute coverage. N bases will be treated as having a base quality of negative infinity and will therefore be excluded from coverage regardless of the value of this parameter. Default value: 20. This option can be set to 'null' to clear the default value.
COVERAGE_CAP (Integer)Treat positions with coverage exceeding this value as if they had coverage at this value (but calculate the difference for PCT_EXC_CAPPED). Default value: 250. This option can be set to 'null' to clear the default value.
LOCUS_ACCUMULATION_CAP (Integer)At positions with coverage exceeding this value, completely ignore reads that accumulate beyond this value (so that they will not be considered for PCT_EXC_CAPPED). Used to keep memory consumption in check, but could create bias if set too low Default value: 100000. This option can be set to 'null' to clear the default value.
STOP_AFTER (Long)For debugging purposes, stop after processing this many genomic bases. Default value: -1. This option can be set to 'null' to clear the default value.
INCLUDE_BQ_HISTOGRAM (Boolean)Determines whether to include the base quality histogram in the metrics file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
COUNT_UNPAIRED (Boolean)If true, count unpaired reads, and paired reads with one end unmapped Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SAMPLE_SIZE (Integer)Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. Default value: 10000. This option can be set to 'null' to clear the default value.
USE_FAST_ALGORITHM (Boolean)If true, fast algorithm is used. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
READ_LENGTH (Integer)Average read length in the file. Default is 150. Default value: 150. This option can be set to 'null' to clear the default value.
INTERVALS (File)An interval list file that contains the positions to restrict the assessment. Please note that all bases of reads that overlap these intervals will be considered, even if some of those bases extend beyond the boundaries of the interval. The ideal use case for this argument is to use it to restrict the calculation to a subset of (whole) contigs. Default value: null.

CompareMetrics

Compare two metrics files.This tool compares the metrics and histograms generated from metric tools to determine if the generated results are identical. This tool is useful to test and compare outputs when code changes are implemented. It is not meant for use by end-users of this toolkit.

The tool's output simply indicates whether two metrics files are equal or not equal.

Usage example:

java -jar picard.jar CompareMetrics \
metricfile1.txt \
metricfile2.txt

CompareSAMs

Compare two input ".sam" or ".bam" files. This tool initially compares the headers of SAM or BAM files. If the file headers are comparable, the tool will examine and compare readUnmapped flag, reference name, start position and strand between the SAMRecords. The tool summarizes information on the number of read pairs that match or mismatch, and of reads that are missing or unmapped (stratified by direction: forward or reverse).

Usage example:

java -jar picard.jar CompareSAMs \
file_1.bam \
file_2.bam

ConvertSequencingArtifactToOxoG

Extract OxoG metrics from generalized artifacts metrics.

This tool extracts 8-oxoguanine (OxoG) artifact metrics from the output of CollectSequencingArtifactsMetrics (a tool that provides detailed information on a variety of artifacts found in sequencing libraries) and converts them to the CollectOxoGMetrics tool's output format. This conveniently eliminates the need to run CollectOxoGMetrics if we already ran CollectSequencingArtifactsMetrics in our pipeline. See the documentation for CollectSequencingArtifactsMetrics and CollectOxoGMetrics for additional information on these tools.

.

Note that only the base of the CollectSequencingArtifactsMetrics output file name is required for the (INPUT_BASE) parameter. For example, if the file name is artifact_metrics.txt.bait_bias_detail_metrics or artifact_metrics.txt.pre_adapter_detail_metrics, only the file name base 'artifact_metrics' is required on the command line for this parameter. An output file called 'artifact_metrics.oxog_metrics' will be generated automatically. Finally, to run this tool successfully, the REFERENCE_SEQUENCE must be provided.

Usage example:

java -jar picard.jar ConvertSequencingArtifactToOxoG \
I=artifact_metrics \
R=reference.fasta
Please see the metrics definitions page at ConvertSequencingArtifactToOxoG for detailed descriptions of the output metrics produced by this tool.

OptionDescription
INPUT_BASE (File)Basename of the input artifact metrics file (output by CollectSequencingArtifactMetrics) Required.
OUTPUT_BASE (File)Basename for output OxoG metrics. Defaults to same basename as input metrics Default value: null.

CreateSequenceDictionary

Creates a sequence dictionary for a reference sequence. This tool creates a sequence dictionary file (with ".dict" extension) from a reference sequence provided in FASTA format, which is required by many processing and analysis tools. The output file contains a header but no SAMRecords, and the header contains only sequence records.

The reference sequence can be gzipped (both .fasta and .fasta.gz are supported).

Usage example:

java -jar picard.jar CreateSequenceDictionary \ 
R=reference.fasta \
O=reference.dict

OptionDescription
REFERENCE (File)Input reference fasta or fasta.gz Required.
OUTPUT (File)Output SAM file containing only the sequence dictionary. By default it will use the base name of the input reference with the .dict extension Default value: null.
GENOME_ASSEMBLY (String)Put into AS field of sequence dictionary entry if supplied Default value: null.
URI (String)Put into UR field of sequence dictionary entry. If not supplied, input reference file is used Default value: null.
SPECIES (String)Put into SP field of sequence dictionary entry Default value: null.
TRUNCATE_NAMES_AT_WHITESPACE (Boolean)Make sequence name the first word from the > line in the fasta file. By default the entire contents of the > line is used, excluding leading and trailing whitespace. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
NUM_SEQUENCES (Integer)Stop after writing this many sequences. For testing. Default value: 2147483647. This option can be set to 'null' to clear the default value.

DownsampleSam

Downsample a SAM or BAM file. This tool applies a random downsampling algorithm to a SAM or BAM file to retain only a random subset of the reads. Reads in a mate-pair are either both kept or both discarded. Reads marked as not primary alignments are all discarded. Each read is given a probability P of being retained so that runs performed with the exact same input in the same order and with the same value for RANDOM_SEED will produce the same results.All reads for a template are kept or discarded as a unit, with the goal of retaining readsfrom PROBABILITY * input templates. While this will usually result in approximately PROBABILITY * input reads being retained also, for very small PROBABILITIES this may not be the case. A number of different downsampling strategies are supported using the STRATEGY option: ConstantMemory: Downsamples a stream or file of SAMRecords using a hash-projection strategy such that it can run in constant memory. The downsampling is stochastic, and therefore the actual retained proportion will vary around the requested proportion. Due to working in fixed memory this strategy is good for large inputs, and due to the stochastic nature the accuracy of this strategy is highest with a high number of output records, and diminishes at low output volumes. HighAccuracy: Attempts (but does not guarantee) to provide accuracy up to a specified limit. Accuracy is defined as emitting a proportion of reads as close to the requested proportion as possible. In order to do so this strategy requires memory that is proportional to the number of template names in the incoming stream of reads, and will thus require large amounts of memory when running on large input files. Chained: Attempts to provide a compromise strategy that offers some of the advantages of both the ConstantMemory and HighAccuracy strategies. Uses a ConstantMemory strategy to downsample the incoming stream to approximately the desired proportion, and then a HighAccuracy strategy to finish. Works in a single pass, and will provide accuracy close to (but often not as good as) HighAccuracy while requiring memory proportional to the set of reads emitted from the ConstantMemory strategy to the HighAccuracy strategy. Works well when downsampling large inputs to small proportions (e.g. downsampling hundreds of millions of reads and retaining only 2%. Should be accurate 99.9% of the time when the input contains >= 50,000 templates (read names). For smaller inputs, HighAccuracy is recommended instead.

Usage example:

java -jar picard.jar DownsampleSam \
I=input.bam \
O=downsampled.bam

OptionDescription
INPUT (File)The input SAM or BAM file to downsample. Required.
OUTPUT (File)The output, downsampled, SAM or BAM file to write. Required.
STRATEGY (Strategy)The downsampling strategy to use. See usage for discussion. Default value: ConstantMemory. This option can be set to 'null' to clear the default value. Possible values: {HighAccuracy, ConstantMemory, Chained}
RANDOM_SEED (Integer)Random seed to use if deterministic behavior is desired. Setting to null will cause multiple invocations to produce different results. Default value: 1. This option can be set to 'null' to clear the default value.
PROBABILITY (Double)The probability of keeping any individual read, between 0 and 1. Default value: 1.0. This option can be set to 'null' to clear the default value.
ACCURACY (Double)The accuracy that the downsampler should try to achieve if the selected strategy supports it. Note that accuracy is never guaranteed, but some strategies will attempt to provide accuracy within the requested bounds.Higher accuracy will generally require more memory. Default value: 1.0E-4. This option can be set to 'null' to clear the default value.
METRICS_FILE (File)The file to write metrics to (QualityYieldMetrics) Default value: null.

ExtractIlluminaBarcodes

Tool determines the barcode for each read in an Illumina lane.

This tool determines the numbers of reads containing barcode-matching sequences and provides statistics on the quality of these barcode matches.

Illumina sequences can contain at least two types of barcodes, sample and molecular (index). Sample barcodes (B in the read structure) are used to demultiplex pooled samples while index barcodes (M in the read structure) are used to differentiate multiple reads of a template when carrying out paired-end sequencing. Note that this tool only extracts sample (B) and not molecular barcodes (M).

Barcodes can be provided in the form of a list (BARCODE_FILE) or a string representing the barcode (BARCODE). The BARCODE_FILE contains multiple fields including 'barcode_sequence_1', 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'. In contrast, the BARCODE argument is used for runs with reads containing a single barcode (nonmultiplexed) and can be added directly as a string of text e.g. BARCODE=CAATAGCG.

Data is output per lane/tile within the BaseCalls directory with the file name format of 's_{lane}_{tile}_barcode.txt'. These files contain the following tab-separated columns:

  • Read subsequence at barcode position
  • Y or N indicating if there was a barcode match
  • Matched barcode sequence (empty if read did not match one of the barcodes)
  • The number of mismatches if there was a barcode match
  • The number of mismatches to the second best barcode if there was a barcode match
If there is no match but we're close to the threshold of calling it a match, we output the barcode that would have been matched but in lower case. Threshold values can be adjusted to accommodate barcode sequence mismatches from the reads. The metrics file produced by the ExtractIlluminaBarcodes program indicates the number of matches (and mismatches) between the barcode reads and the actual barcodes. These metrics are provided both per-barcode and per lane and can be found in the BaseCalls directory.

For poorly matching barcodes, the order of specification of barcodes can cause arbitrary output differences.

Usage example:

java -jar picard.jar ExtractIlluminaBarcodes \
BASECALLS_DIR=/BaseCalls/ \
LANE=1 \
READ_STRUCTURE=25T8B25T \
BARCODE_FILE=barcodes.txt \
METRICS_FILE=metrics_output.txt
Please see the ExtractIlluminaBarcodes.BarcodeMetric definitions for a complete description of the metrics produced by this tool.


OptionDescription
BASECALLS_DIR (File)The Illumina basecalls directory. Required.
OUTPUT_DIR (File)Where to write _barcode.txt files. By default, these are written to BASECALLS_DIR. Default value: null.
LANE (Integer)Lane number. Required.
READ_STRUCTURE (String)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. Required.
BARCODE (String)Barcode sequence. These must be unique, and all the same length. This cannot be used with reads that have more than one barcode; use BARCODE_FILE in that case. Default value: null. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) BARCODE_FILE
BARCODE_FILE (File)Tab-delimited file of barcode sequences, barcode name and, optionally, library name. Barcodes must be unique and all the same length. Column headers must be 'barcode_sequence_1', 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'. Required. Cannot be used in conjuction with option(s) BARCODE
METRICS_FILE (File)Per-barcode and per-lane metrics written to this file. Required.
MAX_MISMATCHES (Integer)Maximum mismatches for a barcode to be considered a match. Default value: 1. This option can be set to 'null' to clear the default value.
MIN_MISMATCH_DELTA (Integer)Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match. Default value: 1. This option can be set to 'null' to clear the default value.
MAX_NO_CALLS (Integer)Maximum allowable number of no-calls in a barcode read before it is considered unmatchable. Default value: 2. This option can be set to 'null' to clear the default value.
MINIMUM_BASE_QUALITY (Integer)Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even in the bases match. Default value: 0. This option can be set to 'null' to clear the default value.
MINIMUM_QUALITY (Integer)The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower. Default value: 2. This option can be set to 'null' to clear the default value.
COMPRESS_OUTPUTS (Boolean)Compress output s_l_t_barcode.txt files using gzip and append a .gz extension to the file names. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
NUM_PROCESSORS (Integer)Run this many PerTileBarcodeExtractors in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0 then the number of cores used will be the number available on the machine less NUM_PROCESSORS. Default value: 1. This option can be set to 'null' to clear the default value.

EstimateLibraryComplexity

Estimates the numbers of unique molecules in a sequencing library.

This tool outputs quality metrics for a sequencing library preparation.Library complexity refers to the number of unique DNA fragments present in a given library. Reductions in complexity resulting from PCR amplification during library preparation will ultimately compromise downstream analyses via an elevation in the number of duplicate reads. PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

This tool attempts to estimate library complexity from sequence of read pairs alone. Reads are sorted by the first N bases (5 by default) of the first read and then the first N bases of the second read of a pair. Read pairs are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). Reads of poor quality are filtered out to provide a more accurate estimate. The filtering removes reads with any poor quality bases as defined by a read's MIN_MEAN_QUALITY (20 is the default value) across either the first or second read. Unpaired reads are ignored in this computation.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment information used in this algorithm, an additional filter is applied to the data as follows. After examining all reads, a histogram is built in which the number of reads in a duplicate set is compared with the number of of duplicate sets. All bins that contain exactly one duplicate set are then removed from the histogram as outliers prior to the library size estimation.

Usage example:

java -jar picard.jar EstimateLibraryComplexity \
I=input.bam \
O=est_lib_complex_metrics.txt
Please see the documentation for the companion MarkDuplicates tool.

OptionDescription
INPUT (File)One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)Output file to writes per-library metrics to. Required.
MIN_IDENTICAL_BASES (Integer)The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5. This option can be set to 'null' to clear the default value.
MAX_DIFF_RATE (Double)The maximum rate of differences between two reads to call them identical. Default value: 0.03. This option can be set to 'null' to clear the default value.
MIN_MEAN_QUALITY (Integer)The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20. This option can be set to 'null' to clear the default value.
MAX_GROUP_RATIO (Integer)Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads. Default value: 500. This option can be set to 'null' to clear the default value.
BARCODE_TAG (String)Barcode SAM tag (ex. BC for 10X Genomics) Default value: null.
READ_ONE_BARCODE_TAG (String)Read one barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
READ_TWO_BARCODE_TAG (String)Read two barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
MAX_READ_LENGTH (Integer)The maximum number of bases to consider when comparing reads (0 means no maximum). Default value: 0. This option can be set to 'null' to clear the default value.
MIN_GROUP_COUNT (Integer)Minimum number group count. On a per-library basis, we count the number of groups of duplicates that have a particular size. Omit from consideration any count that is less than this value. For example, if we see only one group of duplicates with size 500, we omit it from the metric calculations if MIN_GROUP_COUNT is set to two. Setting this to two may help remove technical artifacts from the library size calculation, for example, adapter dimers. Default value: 2. This option can be set to 'null' to clear the default value.
READ_NAME_REGEX (String)Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to 'null' to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE (Integer)The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.

FastqToSam

Converts a FASTQ file to an unaligned BAM or SAM file. This tool extracts read sequences and base qualities from the input FASTQ file and writes them out to a new file in unaligned BAM (uBAM) format. Read group information can be provided on the command line.

Three versions of FASTQ quality scales are supported: FastqSanger, FastqSolexa and FastqIllumina (see http://maq.sourceforge.net/fastq.shtml for details). Input FASTQ files can be in GZip format (with .gz extension).

Usage example:

java -jar picard.jar FastqToSam \
F1=file_1.fastq \
O=fastq_to_bam.bam \
SM=for_tool_testing

OptionDescription
FASTQ (File)Input fastq file (optionally gzipped) for single end data, or first read in paired end data. Required.
FASTQ2 (File)Input fastq file (optionally gzipped) for the second read of paired end data. Default value: null.
USE_SEQUENTIAL_FASTQS (Boolean)Use sequential fastq files with the suffix _###.fastq or _###.fastq.gz Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
QUALITY_FORMAT (FastqQualityFormat)A value describing how the quality values are encoded in the input FASTQ file. Either Solexa (phred scaling + 66), Illumina (phred scaling + 64) or Standard (phred scaling + 33). If this value is not specified, the quality format will be detected automatically. Default value: null. Possible values: {Solexa, Illumina, Standard}
OUTPUT (File)Output SAM/BAM file. Required.
READ_GROUP_NAME (String)Read group name Default value: A. This option can be set to 'null' to clear the default value.
SAMPLE_NAME (String)Sample name to insert into the read group header Required.
LIBRARY_NAME (String)The library name to place into the LB attribute in the read group header Default value: null.
PLATFORM_UNIT (String)The platform unit (often run_barcode.lane) to insert into the read group header Default value: null.
PLATFORM (String)The platform type (e.g. illumina, solid) to insert into the read group header Default value: null.
SEQUENCING_CENTER (String)The sequencing center from which the data originated Default value: null.
PREDICTED_INSERT_SIZE (Integer)Predicted median insert size, to insert into the read group header Default value: null.
PROGRAM_GROUP (String)Program group to insert into the read group header. Default value: null.
PLATFORM_MODEL (String)Platform model to insert into the group header (free-form text providing further details of the platform/technology used) Default value: null.
COMMENT (String)Comment(s) to include in the merged output file's header. Default value: null. This option may be specified 0 or more times.
DESCRIPTION (String)Inserted into the read group header Default value: null.
RUN_DATE (Iso8601Date)Date the run was produced, to insert into the read group header Default value: null.
SORT_ORDER (SortOrder)The sort order for the output sam/bam file. Default value: queryname. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
MIN_Q (Integer)Minimum quality allowed in the input fastq. An exception will be thrown if a quality is less than this value. Default value: 0. This option can be set to 'null' to clear the default value.
MAX_Q (Integer)Maximum quality allowed in the input fastq. An exception will be thrown if a quality is greater than this value. Default value: 93. This option can be set to 'null' to clear the default value.
STRIP_UNPAIRED_MATE_NUMBER (Boolean)Deprecated (No longer used). If true and this is an unpaired fastq any occurrence of '/1' or '/2' will be removed from the end of a read name. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALLOW_AND_IGNORE_EMPTY_LINES (Boolean)Allow (and ignore) empty lines Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

FifoBuffer

Provides a large, configurable, FIFO buffer that can be used to buffer input and output streams between programs with a buffer size that is larger than that offered by native unix FIFOs (usually 64k).

OptionDescription
BUFFER_SIZE (Integer)The size of the memory buffer in bytes. Default value: 536870912. This option can be set to 'null' to clear the default value.
IO_SIZE (Integer)The size, in bytes, to read/write atomically to the input and output streams. Default value: 65536. This option can be set to 'null' to clear the default value.
DEBUG_FREQUENCY (Integer)How frequently, in seconds, to report debugging statistics. Set to zero for never. Default value: 0. This option can be set to 'null' to clear the default value.
NAME (String)Name to use for Fifo in debugging statements. Default value: null.

FindMendelianViolations

Finds mendelian violations of all types within a VCF. Takes in VCF or BCF and a pedigree file and looks for high confidence calls where the genotype of the offspring is incompatible with the genotypes of the parents. Assumes the existence of format fields AD, DP, GT, GQ, and PL fields. Take note that the implementation assumes that reads from the PAR will be mapped to the female chromosomerather than the male. This requires that the PAR in the male chromosome be masked so that the aligner has a single coting to map to. This is normally done for the public releases of the human reference. Usage example: java -jar picard.jar FindMendelianViolations I=input.vcf \ TRIO=family.ped \ OUTPUT=mendelian.txt \ MIN_DP=20

OptionDescription
INPUT (File)Input VCF or BCF with genotypes. Required.
TRIOS (File)File of Trio information in PED format (with no genotype columns). Required.
OUTPUT (File)Output metrics file. Required.
MIN_GQ (Integer)Minimum genotyping quality (or non-ref likelihood) to perform tests. Default value: 30. This option can be set to 'null' to clear the default value.
MIN_DP (Integer)Minimum depth in each sample to consider possible mendelian violations. Default value: 0. This option can be set to 'null' to clear the default value.
MIN_HET_FRACTION (Double)Minimum allele balance at sites that are heterozygous in the offspring. Default value: 0.3. This option can be set to 'null' to clear the default value.
VCF_DIR (File)If provided, output per-family VCFs of mendelian violations into this directory. Default value: null.
SKIP_CHROMS (String)List of chromosome names to skip entirely. Default value: [MT, chrM]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
MALE_CHROMS (String)List of possible names for male sex chromosome(s) Default value: [chrY, Y]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
FEMALE_CHROMS (String)List of possible names for female sex chromosome(s) Default value: [chrX, X]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
PSEUDO_AUTOSOMAL_REGIONS (String)List of chr:start-end for pseudo-autosomal regions on the female sex chromosome. Defaults to HG19/b37 & HG38 coordinates. Default value: [chrX:10000-2781479, X:10001-2649520, chrX:155701382-156030895, X:59034050-59373566]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
THREAD_COUNT (Integer)The number of threads that will be used to collect the metrics. Default value: 1. This option can be set to 'null' to clear the default value.
TAB_MODE (Boolean)If true then fields need to be delimited by a single tab. If false the delimiter is one or more whitespace characters. Note that tab mode does not strictly follow the PED spec Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

CrosscheckFingerprints

Checks if all fingerprints within a set of files appear to come from the same individual. The fingerprints are calculated initially at the readgroup level (if present) but can be "rolled-up" by library, sample or file, to increase power and provide results at the desired resolution. Regular output is in a "Moltenized" format, one row per comparison. In this format the output will include the LOD score and also tumor-aware LOD score which can help assess identity even in the presence of a severe LOH sample with high purity. A matrix output is also available to facilitate visual inspection of crosscheck results. A separate CLP, ClusterCrosscheckMetrics, can cluster the results as a connected graph according to LOD greater than a threshold.

OptionDescription
INPUT (File)One or more input files (or lists of files) to compare fingerprints for. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)Optional output file to write metrics to. Default is to write to stdout. Default value: null.
MATRIX_OUTPUT (File)Optional output file to write matrix of LOD scores to. This is less informative than the metrics output and only contains Normal-Normal LOD score (i.e. doesn't account for Loss of heterogeneity). It is however sometimes easier to use visually. Default value: null.
HAPLOTYPE_MAP (File)The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details. Required.
LOD_THRESHOLD (Double)If any two groups (with the same sample name) match with a LOD score lower than the threshold the program will exit with a non-zero code to indicate error. Program will also exit with an error if it finds two groups with different sample name that match with a LOD score greater than -LOD_THRESHOLD. LOD score 0 means equal likelihoodthat the groups match vs. come from different individuals, negative LOD scores mean N logs more likely that the groups are from different individuals, and positive numbers mean N logs more likely that the groups are from the sample individual. Default value: 0.0. This option can be set to 'null' to clear the default value.
CROSSCHECK_BY (DataType)Specificies which data-type should be used as the basic comparison unit. Fingerprints from readgroups can be "rolled-up" to the LIBRARY, SAMPLE, or FILE level before being compared. Fingerprints from VCF can be be compared by SAMPLE or FILE. Default value: READGROUP. This option can be set to 'null' to clear the default value. Possible values: {FILE, SAMPLE, LIBRARY, READGROUP}
NUM_THREADS (Integer)The number of threads to use to process files and generate Fingerprints. Default value: 1. This option can be set to 'null' to clear the default value.
ALLOW_DUPLICATE_READS (Boolean)Allow the use of duplicate reads in performing the comparison. Can be useful when duplicate marking has been overly aggressive and coverage is low. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
GENOTYPING_ERROR_RATE (Double)Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Default value: 0.01. This option can be set to 'null' to clear the default value.
OUTPUT_ERRORS_ONLY (Boolean)If true then only groups that do not relate to each other as expected will have their LODs reported. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
LOSS_OF_HET_RATE (Double)The rate at which a heterozygous genotype in a normal sample turns into a homozygous (via loss of heterozygosity) in the tumor (model assumes independent events, so this needs to be larger than reality). Default value: 0.5. This option can be set to 'null' to clear the default value.
EXPECT_ALL_GROUPS_TO_MATCH (Boolean)Expect all groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), groups (readgroups, libraries, files, or samples) with different sample names are expected to mismatch, and those with the same sample name are expected to match. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
EXIT_CODE_WHEN_MISMATCH (Integer)When one or more mismatches between groups is detected, exit with this value instead of 0. Default value: 1. This option can be set to 'null' to clear the default value.

ClusterCrosscheckMetrics

Clusters the results from a CrosscheckFingerprints into groups that are connected according to a large enough LOD score.

OptionDescription
INPUT (File)The cross-check metrics file to be clustered Required.
OUTPUT (File)Optional output file to write metrics to. Default is to write to stdout. Default value: null.
LOD_THRESHOLD (Double)LOD score to be used as the threshold for clustering. Default value: 0.0. This option can be set to 'null' to clear the default value.

CheckFingerprint

Computes a fingerprint from the supplied input file (SAM/BAM or VCF) file and compares it to the expected fingerprint genotypes provided. The key output is a LOD score which represents the relative likelihood of the sequence data originating from the same sample as the genotypes vs. from a random sample. Two outputs are produced: (1) a summary metrics file that gives metrics at the single sample level (if the input was a VCF) or at the read level (lane or index within a lane) (if the input was a SAM/BAM) versus a set of known genotypes for the expected sample, and (2) a detail metrics file that contains an individual SNP/Haplotype comparison within a fingerprint comparison. The two files may be specified individually using the SUMMARY_OUTPUT and DETAIL_OUTPUT options. Alternatively the OUTPUT option may be used instead to give the base of the two output files, with the summary metrics having a file extension 'fingerprinting_summary_metrics' and the detail metrics having a file extension 'fingerprinting_detail_metrics'.

OptionDescription
INPUT (File)Input file SAM/BAM or VCF. If a VCF is used, it must have at least one sample. If there are more than one samples in the VCF, the parameter OBSERVED_SAMPLE_ALIAS must be provided in order to indicate which sample's data to use. If there are no samples in the VCF, an exception will be thrown. Required.
OBSERVED_SAMPLE_ALIAS (String)If the input is a VCF, this parameters used to select which sample's data in the VCF to use. Default value: null.
OUTPUT (String)The base prefix of output files to write. The summary metrics will have the file extension 'fingerprinting_summary_metrics' and the detail metrics will have the extension 'fingerprinting_detail_metrics'. Required. Cannot be used in conjuction with option(s) SUMMARY_OUTPUT (S) DETAIL_OUTPUT (D)
SUMMARY_OUTPUT (File)The text file to which to write summary metrics. Required. Cannot be used in conjuction with option(s) OUTPUT (O)
DETAIL_OUTPUT (File)The text file to which to write detail metrics. Required. Cannot be used in conjuction with option(s) OUTPUT (O)
GENOTYPES (File)File of genotypes (VCF or GELI) to be used in comparison. May contain any number of genotypes; CheckFingerprint will use only those that are usable for fingerprinting. Required.
EXPECTED_SAMPLE_ALIAS (String)This parameter can be used to specify which sample's genotypes to use from the expected VCF file (the GENOTYPES file). If it is not supplied, the sample name from the input (VCF or BAM read group header) will be used. Default value: null.
HAPLOTYPE_MAP (File)The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://software.broadinstitute.org/gatk/documentation/article?id=9526 for details. Required.
GENOTYPE_LOD_THRESHOLD (Double)When counting haplotypes checked and matching, count only haplotypes where the most likely haplotype achieves at least this LOD. Default value: 5.0. This option can be set to 'null' to clear the default value.
IGNORE_READ_GROUPS (Boolean)If the input is a SAM/BAM, and this parameter is true, treat the entire input BAM as one single read group in the calculation, ignoring RG annotations, and producing a single fingerprint metric for the entire BAM. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

FilterSamReads

Subset read data from a SAM or BAM fileThis tool takes a SAM or BAM file and subsets it to a new file that either excludes or only includes either aligned or unaligned reads (set using FILTER), or specific reads based on a list of reads names supplied in the READ_LIST_FILE.

Usage example:

java -jar picard.jar FilterSamReads \
I=input.bam \
O=output.bam \
READ_LIST_FILE=read_names.txt FILTER=filter_value
For information on the SAM format, please see: http://samtools.sourceforge.net

OptionDescription
INPUT (File)The SAM or BAM file that will be filtered. Required.
FILTER (Filter)Filter. Required. Possible values: {includeAligned [OUTPUT SAM/BAM will contain aligned reads only. INPUT SAM/BAM must be in queryname SortOrder. (Note that *both* first and second of paired reads must be aligned to be included in the OUTPUT SAM or BAM)], excludeAligned [OUTPUT SAM/BAM will contain un-mapped reads only. INPUT SAM/BAM must be in queryname SortOrder. (Note that *both* first and second of pair must be aligned to be excluded from the OUTPUT SAM or BAM)], includeReadList [OUTPUT SAM/BAM will contain reads that are supplied in the READ_LIST_FILE file], excludeReadList [OUTPUT bam will contain reads that are *not* supplied in the READ_LIST_FILE file], includeJavascript [OUTPUT bam will contain reads that hava been accepted by the JAVASCRIPT_FILE script.], includePairedIntervals [OUTPUT SAM/BAM will contain any reads (and their mate) that overlap with an interval. INPUT SAM/BAM and INTERVAL_LIST must be in coordinate SortOrder. Only aligned reads will be output.]}
READ_LIST_FILE (File)Read List File containing reads that will be included or excluded from the OUTPUT SAM or BAM file. Default value: null.
INTERVAL_LIST (File)Interval List File containing intervals that will be included or excluded from the OUTPUT SAM or BAM file. Default value: null.
SORT_ORDER (SortOrder)SortOrder of the OUTPUT SAM or BAM file, otherwise use the SortOrder of the INPUT file. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
WRITE_READS_FILES (Boolean)Create .reads files (for debugging purposes) Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
OUTPUT (File)SAM or BAM file to write read excluded results to Required.
JAVASCRIPT_FILE (File)Filters a SAM or BAM file with a javascript expression using the java javascript-engine. The script puts the following variables in the script context: 'record' a SamRecord ( https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMRecord.html ) and 'header' a SAMFileHeader ( https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/SAMFileHeader.html ). Last value of the script should be a boolean to tell wether we should accept or reject the record. Default value: null.

FilterVcf

Applies one or more hard filters to a VCF file to filter out genotypes and variants.

OptionDescription
INPUT (File)The INPUT VCF or BCF file. Required.
OUTPUT (File)The output VCF or BCF. Required.
MIN_AB (Double)The minimum allele balance acceptable before filtering a site. Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations. Different heterozygote genotypes at the same locus are measured independently. The locus is filtered if any allele balance is below the limit. Default value: 0.0. This option can be set to 'null' to clear the default value.
MIN_DP (Integer)The minimum sequencing depth supporting a genotype before the genotype will be filtered out. Default value: 0. This option can be set to 'null' to clear the default value.
MIN_GQ (Integer)The minimum genotype quality that must be achieved for a sample otherwise the genotype will be filtered out. Default value: 0. This option can be set to 'null' to clear the default value.
MAX_FS (Double)The maximum phred scaled fisher strand value before a site will be filtered out. Default value: 1.7976931348623157E308. This option can be set to 'null' to clear the default value.
MIN_QD (Double)The minimum QD value to accept or otherwise filter out the variant. Default value: 0.0. This option can be set to 'null' to clear the default value.
JAVASCRIPT_FILE (File)Filters a VCF file with a javascript expression interpreted by the java javascript engine. The script puts the following variables in the script context: 'variant' a VariantContext ( https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/VariantContext.html ) and 'header' a VCFHeader ( https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFHeader.html ). Last value of the script should be a boolean to tell whether we should accept or reject the record. Default value: null.

FixMateInformation

Verify mate-pair information between mates and fix if needed.This tool ensures that all mate-pair information is in sync between each read and its mate pair. If no OUTPUT file is supplied then the output is written to a temporary file and then copied over the INPUT file. Reads marked with the secondary alignment flag are written to the output file unchanged.

Usage example:

java -jar picard.jar FixMateInformation \
I=input.bam \
O=fixed_mate.bam

OptionDescription
INPUT (File)The input file to check and fix. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)The output file to write to. If no output file is supplied, the input file is overwritten. Default value: null.
SORT_ORDER (SortOrder)Optional sort order if the OUTPUT file should be sorted differently than the INPUT file. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
ASSUME_SORTED (Boolean)If true, assume that the input file is queryname sorted, even if the header says otherwise. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ADD_MATE_CIGAR (Boolean)Adds the mate CIGAR tag (MC) if true, does not if false. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
IGNORE_MISSING_MATES (Boolean)If true, ignore missing mates, otherwise will throw an exception when missing mates are found. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}

GatherBamFiles

Concatenate one or more BAM files as efficiently as possibleThis tool performs a rapid "gather" operation on BAM files after scatter operations where the same process has been performed on different regions of a BAM file creating many smaller BAM files that now need to be concatenated (reassembled) back together.

Assumes that the list of BAM files provided as INPUT are in the order that they should be concatenated and simply concatenates the bodies of the BAM files while retaining the header from the first file. Operates via copying of the gzip blocks directly for speed but also supports generation of an MD5 on the output and indexing of the output BAM file. Only supports BAM files, does not support SAM files.

Usage example:

java -jar picard.jar GatherBamFiles \
I=input1.bam \
I=input2.bam \
O=gathered_files.bam

OptionDescription
INPUT (File)Two or more BAM files or text files containing lists of BAM files (one per line). Default value: null. This option may be specified 0 or more times.
OUTPUT (File)The output BAM file to write. Required.

GatherVcfs

Gathers multiple VCF files from a scatter operation into a single VCF file. Input files must be supplied in genomic order and must not have events at overlapping positions.

OptionDescription
INPUT (File)Input VCF file(s). Default value: null. This option may be specified 0 or more times.
OUTPUT (File)Output VCF file. Required.

GenotypeConcordance

Evaluate genotype concordance between callsets.This tool evaluates the concordance between genotype calls for samples in different callsets where one is being considered as the truth (aka standard, or reference) and the other as the call that is being evaluated for accuracy.

Usage example:

java -jar picard.jar GenotypeConcordance \
CALL_VCF=input.vcf \
CALL_SAMPLE=sample_name \
O=gc_concordance.vcf \
TRUTH_VCF=truth_set.vcf \
TRUTH_SAMPLE=truth_sample#

Output Metrics:

  • Output metrics include GenotypeConcordanceContingencyMetrics, GenotypeConcordanceSummaryMetrics, and GenotypeConcordanceDetailMetrics. For each set of metrics, the data is broken into separate sections for SNPs and INDELs. Note that only SNP and INDEL variants are considered, MNP, Symbolic, and Mixed classes of variants are not included.
  • GenotypeConcordanceContingencyMetrics enumerate the constituents of each contingent in a callset including true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) calls. See http://broadinstitute.github.io/picard/picard-metric-definitions.html#GenotypeConcordanceContingencyMetrics for more details.
  • GenotypeConcordanceDetailMetrics include the numbers of SNPs and INDELs for each contingent genotype as well as the number of validated genotypes. See http://broadinstitute.github.io/picard/picard-metric-definitions.html#GenotypeConcordanceDetailMetrics for more details.
  • GenotypeConcordanceSummaryMetrics provide specific details for the variant caller performance on a callset including: values for sensitivity, specificity, and positive predictive values. See http://broadinstitute.github.io/picard/picard-metric-definitions.html#GenotypeConcordanceSummaryMetrics for more details.


Useful definitions applicable to alleles and genotypes:
  • Truthset - A callset (typically in VCF format) containing variant calls and genotypes that have been cross-validated with multiple technologies e.g. Genome In A Bottle Consortium (GIAB) (https://sites.stanford.edu/abms/giab)
  • TP - True positives are variant calls that match a 'truthset'
  • FP - False-positives are reference sites miscalled as variant
  • FN - False-negatives are variant sites miscalled as reference
  • TN - True negatives are correctly called reference sites
  • Validated genotypes - are TP sites where the exact genotype (HET or HOM-VAR) has been validated

VCF Output:

  • The concordance state will be stored in the "CONC_ST" tag in the INFO field.
  • The truth sample name will be "truth" and call sample name will be "call".

OptionDescription
TRUTH_VCF (File)The VCF containing the truth sample Required.
CALL_VCF (File)The VCF containing the call sample Required.
OUTPUT (File)Basename for the two metrics files that are to be written. Resulting files will be .genotype_concordance_summary_metrics and .genotype_concordance_detail_metrics. Required.
OUTPUT_VCF (Boolean)Output a VCF annotated with concordance information. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
TRUTH_SAMPLE (String)The name of the truth sample within the truth VCF. Not required if only one sample exists. Default value: null.
CALL_SAMPLE (String)The name of the call sample within the call VCF. Not required if only one sample exists. Default value: null.
INTERVALS (File)One or more interval list files that will be used to limit the genotype concordance. Note - if intervals are specified, the VCF files must be indexed. Default value: null. This option may be specified 0 or more times.
INTERSECT_INTERVALS (Boolean)If true, multiple interval lists will be intersected. If false multiple lists will be unioned. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MIN_GQ (Integer)Genotypes below this genotype quality will have genotypes classified as LowGq. Default value: 0. This option can be set to 'null' to clear the default value.
MIN_DP (Integer)Genotypes below this depth will have genotypes classified as LowDp. Default value: 0. This option can be set to 'null' to clear the default value.
OUTPUT_ALL_ROWS (Boolean)If true, output all rows in detailed statistics even when count == 0. When false only output rows with non-zero counts. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
USE_VCF_INDEX (Boolean)If true, use the VCF index, else iterate over the entire VCF. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MISSING_SITES_HOM_REF (Boolean)Default is false, which follows the GA4GH Scheme. If true, missing sites in the truth set will be treated as HOM_REF sites and sites missing in both the truth and call sets will be true negatives. Useful when hom ref sites are left out of the truth set. This flag can only be used with a high confidence interval list. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

IlluminaBasecallsToFastq

Generate FASTQ file(s) from Illumina basecall read data.

This tool generates FASTQ files from data in an Illumina BaseCalls output directory. Separate FASTQ files are created for each template, barcode, and index (molecular barcode) read. Briefly, the template reads are the target sequence of your experiment, the barcode sequence reads facilitate sample demultiplexing, and the index reads help mitigate instrument phasing errors. For additional information on the read types, please see the following reference here.

In the absence of sample pooling (multiplexing) and/or barcodes, then an OUTPUT_PREFIX (file directory) must be provided as the sample identifier. For multiplexed samples, a MULTIPLEX_PARAMS file must be specified. The MULTIPLEX_PARAMS file contains the list of sample barcodes used to sort template, barcode, and index reads. It is essentially the same as the BARCODE_FILE used in theExtractIlluminaBarcodes tool.

Files from this tool use the following naming format: {prefix}.{type}_{number}.fastq with the {prefix} indicating the sample barcode, the {type} indicating the types of reads e.g. index, barcode, or blank (if it contains a template read). The {number} indicates the read number, either first (1) or second (2) for paired-end sequencing.

Usage examples:

Example 1: Sample(s) with either no barcode or barcoded without multiplexing 
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
OUTPUT_PREFIX=noBarcode.1 \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX

Example 2: Multiplexed samples
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
MULTIPLEX_PARAMS=demultiplexed_output.txt \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX

The FLOWCELL_BARCODE is required if emitting Casava 1.8-style read name headers.


OptionDescription
BASECALLS_DIR (File)The basecalls directory. Required.
BARCODES_DIR (File)The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR. Default value: null.
LANE (Integer)Lane number. Required.
OUTPUT_PREFIX (File)The prefix for output FASTQs. Extensions as described above are appended. Use this option for a non-barcoded run, or for a barcoded run in which it is not desired to demultiplex reads into separate files by barcode. Required. Cannot be used in conjuction with option(s) MULTIPLEX_PARAMS
RUN_BARCODE (String)The barcode of the run. Prefixed to read names. Required.
MACHINE_NAME (String)The name of the machine on which the run was sequenced; required if emitting Casava1.8-style read name headers Default value: null.
FLOWCELL_BARCODE (String)The barcode of the flowcell that was sequenced; required if emitting Casava1.8-style read name headers Default value: null.
READ_STRUCTURE (String)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. Required.
MULTIPLEX_PARAMS (File)Tab-separated file for creating all output FASTQs demultiplexed by barcode for a lane with single IlluminaBasecallsToFastq invocation. The columns are OUTPUT_PREFIX, and BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify an output_prefix for no barcode match. Required. Cannot be used in conjuction with option(s) OUTPUT_PREFIX (O)
ADAPTERS_TO_CHECK (IlluminaAdapterPair)Deprecated (No longer used). Which adapters to look for in the read. Default value: null. Possible values: {PAIRED_END, INDEXED, SINGLE_END, NEXTERA_V1, NEXTERA_V2, DUAL_INDEXED, FLUIDIGM, TRUSEQ_SMALLRNA, ALTERNATIVE_SINGLE_END} This option may be specified 0 or more times.
NUM_PROCESSORS (Integer)The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS. Default value: 0. This option can be set to 'null' to clear the default value.
FIRST_TILE (Integer)If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order. Default value: null.
TILE_LIMIT (Integer)If set, process no more than this many tiles (used for debugging). Default value: null.
APPLY_EAMSS_FILTER (Boolean)Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
FORCE_GC (Boolean)If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_READS_IN_RAM_PER_TILE (Integer)Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Default value: 1200000. This option can be set to 'null' to clear the default value.
MINIMUM_QUALITY (Integer)The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower. Default value: 2. This option can be set to 'null' to clear the default value.
INCLUDE_NON_PF_READS (Boolean)Whether to include non-PF reads Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
IGNORE_UNEXPECTED_BARCODES (Boolean)Whether to ignore reads whose barcodes are not found in MULTIPLEX_PARAMS. Useful when outputting FASTQs for only a subset of the barcodes in a lane. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
READ_NAME_FORMAT (ReadNameFormat)The read name header formatting to emit. Casava1.8 formatting has additional information beyond Illumina, including: the passing-filter flag value for the read, the flowcell name, and the sequencer name. Default value: CASAVA_1_8. This option can be set to 'null' to clear the default value. Possible values: {CASAVA_1_8, ILLUMINA}
COMPRESS_OUTPUTS (Boolean)Compress output FASTQ files using gzip and append a .gz extension to the file names. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

IlluminaBasecallsToSam

Transforms raw Illumina sequencing data into an unmapped SAM or BAM file.

The IlluminaBaseCallsToSam program collects, demultiplexes, and sorts reads across all of the tiles of a lane via barcode to produce an unmapped SAM/BAM file. An unmapped BAM file is often referred to as a uBAM. All barcode, sample, and library data is provided in the LIBRARY_PARAMS file. Note, this LIBRARY_PARAMS file should be formatted according to the specifications indicated below. The following is an example of a properly formatted LIBRARY_PARAMS file:

BARCODE_1 OUTPUT SAMPLE_ALIAS LIBRARY_NAME AAAAAAAA SA_AAAAAAAA.bam SA_AAAAAAAA LN_AAAAAAAA AAAAGAAG SA_AAAAGAAG.bam SA_AAAAGAAG LN_AAAAGAAG AACAATGG SA_AACAATGG.bam SA_AACAATGG LN_AACAATGG N SA_non_indexed.bam SA_non_indexed LN_NNNNNNNN

The BARCODES_DIR file is produced by the ExtractIlluminaBarcodes tool for each lane of a flow cell.

Usage example:

java -jar picard.jar IlluminaBasecallsToSam \
BASECALLS_DIR=/BaseCalls/ \
LANE=001 \
READ_STRUCTURE=25T8B25T \
RUN_BARCODE=run15 \
IGNORE_UNEXPECTED_BARCODES=true \
LIBRARY_PARAMS=library.params

OptionDescription
BASECALLS_DIR (File)The basecalls directory. Required.
BARCODES_DIR (File)The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR. Default value: null.
LANE (Integer)Lane number. Required.
OUTPUT (File)Deprecated (use LIBRARY_PARAMS). The output SAM or BAM file. Format is determined by extension. Required. Cannot be used in conjuction with option(s) LIBRARY_PARAMS BARCODE_PARAMS
RUN_BARCODE (String)The barcode of the run. Prefixed to read names. Required.
SAMPLE_ALIAS (String)Deprecated (use LIBRARY_PARAMS). The name of the sequenced sample Required. Cannot be used in conjuction with option(s) LIBRARY_PARAMS BARCODE_PARAMS
READ_GROUP_ID (String)ID used to link RG header record with RG tag in SAM record. If these are unique in SAM files that get merged, merge performance is better. If not specified, READ_GROUP_ID will be set to . . Default value: null.
LIBRARY_NAME (String)Deprecated (use LIBRARY_PARAMS). The name of the sequenced library Default value: null. Cannot be used in conjuction with option(s) LIBRARY_PARAMS BARCODE_PARAMS
SEQUENCING_CENTER (String)The name of the sequencing center that produced the reads. Used to set the RG.CN tag. Default value: BI. This option can be set to 'null' to clear the default value.
RUN_START_DATE (Date)The start date of the run. Default value: null.
PLATFORM (String)The name of the sequencing technology that produced the read. Default value: illumina. This option can be set to 'null' to clear the default value.
READ_STRUCTURE (String)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. Required.
BARCODE_PARAMS (File)Deprecated (use LIBRARY_PARAMS). Tab-separated file for creating all output BAMs for barcoded run with single IlluminaBasecallsToSam invocation. Columns are BARCODE, OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME. Row with BARCODE=N is used to specify a file for no barcode match Required. Cannot be used in conjuction with option(s) LIBRARY_PARAMS SAMPLE_ALIAS (ALIAS) OUTPUT (O) LIBRARY_NAME (LIB)
LIBRARY_PARAMS (File)Tab-separated file for creating all output BAMs for a lane with single IlluminaBasecallsToSam invocation. The columns are OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME, BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify a file for no barcode match. You may also provide any 2 letter RG header attributes (excluding PU, CN, PL, and DT) as columns in this file and the values for those columns will be inserted into the RG tag for the BAM file created for a given row. Required. Cannot be used in conjuction with option(s) SAMPLE_ALIAS (ALIAS) OUTPUT (O) LIBRARY_NAME (LIB) BARCODE_PARAMS
ADAPTERS_TO_CHECK (IlluminaAdapterPair)Which adapters to look for in the read. Default value: [INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM]. This option can be set to 'null' to clear the default value. Possible values: {PAIRED_END, INDEXED, SINGLE_END, NEXTERA_V1, NEXTERA_V2, DUAL_INDEXED, FLUIDIGM, TRUSEQ_SMALLRNA, ALTERNATIVE_SINGLE_END} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
FIVE_PRIME_ADAPTER (String)For specifying adapters other than standard Illumina Default value: null.
THREE_PRIME_ADAPTER (String)For specifying adapters other than standard Illumina Default value: null.
NUM_PROCESSORS (Integer)The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS. Default value: 0. This option can be set to 'null' to clear the default value.
FIRST_TILE (Integer)If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order. Default value: null.
TILE_LIMIT (Integer)If set, process no more than this many tiles (used for debugging). Default value: null.
FORCE_GC (Boolean)If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
APPLY_EAMSS_FILTER (Boolean)Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_READS_IN_RAM_PER_TILE (Integer)Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Default value: 1200000. This option can be set to 'null' to clear the default value.
MINIMUM_QUALITY (Integer)The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown.The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower. Default value: 2. This option can be set to 'null' to clear the default value.
INCLUDE_NON_PF_READS (Boolean)Whether to include non-PF reads Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
IGNORE_UNEXPECTED_BARCODES (Boolean)Whether to ignore reads whose barcodes are not found in LIBRARY_PARAMS. Useful when outputting BAMs for only a subset of the barcodes in a lane. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MOLECULAR_INDEX_TAG (String)The tag to use to store any molecular indexes. If more than one molecular index is found, they will be concatenated and stored here. Default value: RX. This option can be set to 'null' to clear the default value.
MOLECULAR_INDEX_BASE_QUALITY_TAG (String)The tag to use to store any molecular index base qualities. If more than one molecular index is found, their qualities will be concatenated and stored here (.i.e. the number of "M" operators in the READ_STRUCTURE) Default value: QX. This option can be set to 'null' to clear the default value.
TAG_PER_MOLECULAR_INDEX (String)The list of tags to store each molecular index. The number of tags should match the number of molecular indexes. Default value: null. This option may be specified 0 or more times.

CheckIlluminaDirectory

Asserts the validity for specified Illumina basecalling data.

This tool will check that the basecall directory and the internal files are available, exist, and are reasonably sized for every tile and cycle. Reasonably sized means non-zero sized for files that exist per tile and equal size for binary files that exist per cycle or per tile. If DATA_TYPES {Position, BaseCalls, QualityScores, PF, or Barcodes} are not specified, then the default data types used by IlluminaBasecallsToSam are used. CheckIlluminaDirectory DOES NOT check that the individual records in a file are well-formed.

Usage example:

java -jar picard.jar CheckIlluminaDirectory \
BASECALLS_DIR=/BaseCalls/ \
READ_STRUCTURE=25T8B25T \
LANES=1 \
DATA_TYPES=BaseCalls

OptionDescription
BASECALLS_DIR (File)The basecalls output directory. Required.
DATA_TYPES (IlluminaDataType)The data types that should be checked for each tile/cycle. If no values are provided then the data types checked are those required by IlluminaBaseCallsToSam (which is a superset of those used in ExtractIlluminaBarcodes). These data types vary slightly depending on whether or not the run is barcoded so READ_STRUCTURE should be the same as that which will be passed to IlluminaBasecallsToSam. If this option is left unspecified then both ExtractIlluminaBarcodes and IlluminaBaseCallsToSam should complete successfully UNLESS the individual records of the files themselves are spurious. Default value: null. Possible values: {Position, BaseCalls, QualityScores, PF, Barcodes} This option may be specified 0 or more times.
READ_STRUCTURE (String)A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. Note: If you want to check whether or not a future IlluminaBasecallsToSam or ExtractIlluminaBarcodes run will fail then be sure to use the exact same READ_STRUCTURE that you would pass to these programs for this run. Required.
LANES (Integer)The number of the lane(s) to check. Default value: null. This option must be specified at least 1 times.
TILE_NUMBERS (Integer)The number(s) of the tile(s) to check. Default value: null. This option may be specified 0 or more times.
FAKE_FILES (Boolean)A flag to determine whether or not to create fake versions of the missing files. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
LINK_LOCS (Boolean)A flag to create symlinks to the loc file for the X Ten for each tile. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

CheckTerminatorBlock

Asserts the provided gzip file's (e.g., BAM) last block is well-formed; RC 100 otherwise

OptionDescription
INPUT (File)The block compressed file to check. Required.

IntervalListTools

Manipulates interval lists. This tool offers multiple interval list file manipulation capabilities include sorting, merging, subtracting, padding, customizing, and other set-theoretic operations. If given one or more inputs, the default operation is to merge and sort them. Other options e.g. interval subtraction are controlled by the arguments. The tool lists intervals with respect to a reference sequence.

Both interval_list and VCF files are accepted as input. The interval_list file format is relatively simple and reflects the SAM alignment format to a degree. A SAM style header must be present in the file that lists the sequence records against which the intervals are described. After the header, the file then contains records, one per line in text format with the following values tab-separated:

     -Sequence name (SN) 
-Start position (1-based)**
-End position (1-based, end inclusive)
-Strand (either + or -)
-Interval name (ideally unique names for intervals)
The coordinate system of interval_list files is such that the first base or position in a sequence is position "1".

Usage example:

java -jar picard.jar IntervalListTools \
I=input.interval_list \
SI=input_2.interval_list \
O=new.interval_list

OptionDescription
INPUT (File)One or more interval lists. If multiple interval lists are provided the output is theresult of merging the inputs. Supported formats are interval_list and VCF. Default value: null. This option must be specified at least 1 times.
OUTPUT (File)The output interval list file to write (if SCATTER_COUNT is 1) or the directory into which to write the scattered interval sub-directories (if SCATTER_COUNT > 1) Default value: null.
PADDING (Integer)The amount to pad each end of the intervals by before other operations are undertaken. Negative numbers are allowed and indicate intervals should be shrunk. Resulting intervals < 0 bases long will be removed. Padding is applied to the interval lists before the ACTION is performed. Default value: 0. This option can be set to 'null' to clear the default value.
UNIQUE (Boolean)If true, merge overlapping and adjacent intervals to create a list of unique intervals. Implies SORT=true Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SORT (Boolean)If true, sort the resulting interval list by coordinate. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ACTION (Action)Action to take on inputs. Default value: CONCAT. This option can be set to 'null' to clear the default value. Possible values: { CONCAT (The concatenation of all the INPUTs, no sorting or merging of overlapping/abutting intervals implied. Will result in an unsorted list unless requested otherwise.) UNION (Like CONCATENATE but with UNIQUE and SORT implied, the result being the set-wise union of all INPUTS.) INTERSECT (The sorted, uniqued set of all loci that are contained in all of the INPUTs.) SUBTRACT (Subtracts SECOND_INPUT from INPUT. The resulting loci are there in INPUT that are not in SECOND_INPUT) SYMDIFF (Find loci that are in INPUT or SECOND_INPUT but are not in both.) OVERLAPS (Find only intervals in INPUT that have any overlap with SECOND_INPUT) }
SECOND_INPUT (File)Second set of intervals for SUBTRACT and DIFFERENCE operations. Default value: null. This option may be specified 0 or more times.
COMMENT (String)One or more lines of comment to add to the header of the output file. Default value: null. This option may be specified 0 or more times.
SCATTER_COUNT (Integer)The number of files into which to scatter the resulting list by locus; in some situations, fewer intervals may be emitted. Note - if > 1, the resultant scattered intervals will be sorted and uniqued. The sort will be inverted if the INVERT flag is set. Default value: 1. This option can be set to 'null' to clear the default value.
INCLUDE_FILTERED (Boolean)Whether to include filtered variants in the vcf when generating an interval list from vcf Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
BREAK_BANDS_AT_MULTIPLES_OF (Integer)If set to a positive value will create a new interval list with the original intervals broken up at integer multiples of this value. Set to 0 to NOT break up intervals Default value: 0. This option can be set to 'null' to clear the default value.
SUBDIVISION_MODE (Mode)Do not subdivide Default value: INTERVAL_SUBDIVISION. This option can be set to 'null' to clear the default value. Possible values: {INTERVAL_SUBDIVISION, BALANCING_WITHOUT_INTERVAL_SUBDIVISION, BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW}
INVERT (Boolean)Produce the inverse list Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

LiftOverIntervalList

Lifts over an interval list from one reference build to another. This tool adjusts the coordinates in an interval list derived from one reference to match a new reference, based on a chain file that describes the correspondence between the two references. It is based on the UCSC liftOver tool (see: http://genome.ucsc.edu/cgi-bin/hgLiftOver) and uses a UCSC chain file to guide its operation. It accepts both Picard interval_list files or VCF files as interval inputs.

Usage example:

java -jar picard.jar LiftOverIntervalList \
I=input.interval_list \
O=output.interval_list \
SD=reference_sequence.dict \
CHAIN=build.chain

OptionDescription
INPUT (File)Interval list to be lifted over. Required.
OUTPUT (File)Where to write lifted-over interval list. Required.
SEQUENCE_DICTIONARY (File)Sequence dictionary to write into the output interval list. Required.
CHAIN (File)Chain file that guides LiftOver. Required.
MIN_LIFTOVER_PCT (Double)Minimum percentage of bases in each input interval that must map to output interval. Default value: 0.95. This option can be set to 'null' to clear the default value.

LiftoverVcf

Lifts over a VCF file from one reference build to another. This tool adjusts the coordinates of variants within a VCF file to match a new reference. The output file will be sorted and indexed using the target reference build. To be clear, REFERENCE_SEQUENCE should be the target reference build. The tool is based on the UCSC liftOver tool (see: http://genome.ucsc.edu/cgi-bin/hgLiftOver) and uses a UCSC chain file to guide its operation.

Note that records may be rejected because they cannot be lifted over or because of sequence incompatibilities between the source and target reference genomes. Rejected records will be emitted with filters to the REJECT file, using the source genome coordinates.

Usage example:

java -jar picard.jar LiftoverVcf \
I=input.vcf \
O=lifted_over.vcf \
CHAIN=b37tohg19.chain \
REJECT=rejected_variants.vcf \
R=reference_sequence.fasta
For additional information, please see: http://genome.ucsc.edu/cgi-bin/hgLiftOver

OptionDescription
INPUT (File)The input VCF/BCF file to be lifted over. Required.
OUTPUT (File)The output location to write the lifted over VCF/BCF to. Required.
CHAIN (File)The liftover chain file. See https://genome.ucsc.edu/goldenPath/help/chain.html for a description of chain files. See http://hgdownload.soe.ucsc.edu/downloads.html#terms for where to download chain files. Required.
REJECT (File)File to which to write rejected records. Required.
REFERENCE_SEQUENCE (File)The reference sequence (fasta) for the TARGET genome build. The fasta file must have an accompanying sequence dictionary (.dict file). Required.
WARN_ON_MISSING_CONTIG (Boolean)Warn on missing contig. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
WRITE_ORIGINAL_POSITION (Boolean)Write the original contig/position for lifted variants to the INFO field. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
LIFTOVER_MIN_MATCH (Double)The minimum percent match required for a variant to be lifted. Default value: 1.0. This option can be set to 'null' to clear the default value.
ALLOW_MISSING_FIELDS_IN_HEADER (Boolean)Allow INFO and FORMAT in the records that are not found in the header Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

MakeSitesOnlyVcf

Reads a VCF/VCF.gz/BCF and removes all genotype information from it while retaining all site level information, including annotations based on genotypes (e.g. AN, AF). Output an be any support variant format including .vcf, .vcf.gz or .bcf.

OptionDescription
INPUT (File)Input VCF or BCF Required.
OUTPUT (File)Output VCF or BCF to emit without per-sample info. Required.
SAMPLE (String)Optionally one or more samples to retain when building the 'sites-only' VCF. Default value: null. This option may be specified 0 or more times.

MarkDuplicates

Identifies duplicate reads.

This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate duplicate marking using molecular barcodes. After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method).

The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. If you are not familiar with this type of annotation, please see the following blog post for additional information.

Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in the 'optional field' section of a SAM/BAM file. Invoking the TAGGING_POLICY option, you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked TAGGING_POLICY), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ). This tool uses the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options as the primary methods to identify and differentiate duplicate types. Set READ_NAME_REGEX to null to skip optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate.

MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.

The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.

If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options.

Usage example:

java -jar picard.jar MarkDuplicates \
I=input.bam \
O=marked_duplicates.bam \
M=marked_dup_metrics.txt
Please see MarkDuplicates for detailed explanations of the output metrics.

OptionDescription
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP (Integer)This option is obsolete. ReadEnds will always be spilled to disk. Default value: 50000. This option can be set to 'null' to clear the default value.
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP (Integer)Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system. Default value: 8000. This option can be set to 'null' to clear the default value.
SORTING_COLLECTION_SIZE_RATIO (Double)This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number. Default value: 0.25. This option can be set to 'null' to clear the default value.
BARCODE_TAG (String)Barcode SAM tag (ex. BC for 10X Genomics) Default value: null.
READ_ONE_BARCODE_TAG (String)Read one barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
READ_TWO_BARCODE_TAG (String)Read two barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
TAG_DUPLICATE_SET_MEMBERS (Boolean)If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REMOVE_SEQUENCING_DUPLICATES (Boolean)If true remove 'optical' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
TAGGING_POLICY (DuplicateTaggingPolicy)Determines how duplicate types are recorded in the DT optional attribute. Default value: DontTag. This option can be set to 'null' to clear the default value. Possible values: {DontTag, OpticalOnly, All}
INPUT (String)One or more input SAM or BAM files to analyze. Must be coordinate sorted. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)The output file to write marked records to Required.
METRICS_FILE (File)File to write duplication metrics to Required.
REMOVE_DUPLICATES (Boolean)If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ASSUME_SORTED (Boolean)If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO)
ASSUME_SORT_ORDER (SortOrder)If not null, assume that the input file has this order even if the header says otherwise. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown} Cannot be used in conjuction with option(s) ASSUME_SORTED (AS)
DUPLICATE_SCORING_STRATEGY (ScoringStrategy)The scoring strategy for choosing the non-duplicate among candidates. Default value: SUM_OF_BASE_QUALITIES. This option can be set to 'null' to clear the default value. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH, RANDOM}
PROGRAM_RECORD_ID (String)The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value.
PROGRAM_GROUP_VERSION (String)Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. Default value: null.
PROGRAM_GROUP_COMMAND_LINE (String)Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. Default value: null.
PROGRAM_GROUP_NAME (String)Value of PN tag of PG record to be created. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value.
COMMENT (String)Comment(s) to include in the output file's header. Default value: null. This option may be specified 0 or more times.
READ_NAME_REGEX (String)Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to 'null' to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE (Integer)The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.

MarkDuplicatesWithMateCigar

Identifies duplicate reads, accounting for mate CIGAR. This tool locates and tags duplicate reads (both PCR and optical) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA, taking into account the CIGAR string of read mates.

It is intended as an improvement upon the original MarkDuplicates algorithm, from which it differs in several ways, includingdifferences in how it breaks ties. It may be the most effective duplicate marking program available, as it handles all cases including clipped and gapped alignments and locates duplicate molecules using mate cigar information. However, please note that it is not yet used in the Broad's production pipeline, so use it at your own risk.

Note also that this tool will not work with alignments that have large gaps or deletions, such as those from RNA-seq data. This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips (ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar MarkDuplicatesWithMateCigar \
I=input.bam \
O=mark_dups_w_mate_cig.bam \
M=mark_dups_w_mate_cig_metrics.txt

OptionDescription
MINIMUM_DISTANCE (Integer)The minimum distance to buffer records to account for clipping on the 5' end of the records. For a given alignment, this parameter controls the width of the window to search for duplicates of that alignment. Due to 5' read clipping, duplicates do not necessarily have the same 5' alignment coordinates, so the algorithm needs to search around the neighborhood. For single end sequencing data, the neighborhood is only determined by the amount of clipping (assuming no split reads), thus setting MINIMUM_DISTANCE to twice the sequencing read length should be sufficient. For paired end sequencing, the neighborhood is also determined by the fragment insert size, so you may want to set MINIMUM_DISTANCE to something like twice the 99.5% percentile of the fragment insert size distribution (see CollectInsertSizeMetrics). Or you can set this number to -1 to use either a) twice the first read's read length, or b) 100, whichever is smaller. Note that the larger the window, the greater the RAM requirements, so you could run into performance limitations if you use a value that is unnecessarily large. Default value: -1. This option can be set to 'null' to clear the default value.
SKIP_PAIRS_WITH_NO_MATE_CIGAR (Boolean)Skip record pairs with no mate cigar and include them in the output. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
BLOCK_SIZE (Integer)The block size for use in the coordinate-sorted record buffer. Default value: 100000. This option can be set to 'null' to clear the default value.
INPUT (String)One or more input SAM or BAM files to analyze. Must be coordinate sorted. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)The output file to write marked records to Required.
METRICS_FILE (File)File to write duplication metrics to Required.
REMOVE_DUPLICATES (Boolean)If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ASSUME_SORTED (Boolean)If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO)
ASSUME_SORT_ORDER (SortOrder)If not null, assume that the input file has this order even if the header says otherwise. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown} Cannot be used in conjuction with option(s) ASSUME_SORTED (AS)
DUPLICATE_SCORING_STRATEGY (ScoringStrategy)The scoring strategy for choosing the non-duplicate among candidates. Default value: TOTAL_MAPPED_REFERENCE_LENGTH. This option can be set to 'null' to clear the default value. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH, RANDOM}
PROGRAM_RECORD_ID (String)The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value.
PROGRAM_GROUP_VERSION (String)Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. Default value: null.
PROGRAM_GROUP_COMMAND_LINE (String)Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. Default value: null.
PROGRAM_GROUP_NAME (String)Value of PN tag of PG record to be created. Default value: MarkDuplicatesWithMateCigar. This option can be set to 'null' to clear the default value.
COMMENT (String)Comment(s) to include in the output file's header. Default value: null. This option may be specified 0 or more times.
READ_NAME_REGEX (String)Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to 'null' to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE (Integer)The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.

MeanQualityByCycle

Collect mean quality by cycle.This tool generates a data table and chart of mean quality by cycle from a BAM file. It is intended to be used on a single lane or a read group's worth of data, but can be applied to merged BAMs if needed.

This metric gives an overall snapshot of sequencing machine performance. For most types of sequencing data, the output is expected to show a slight reduction in overall base quality scores towards the end of each read. Spikes in quality within reads are not expected and may indicate that technical problems occurred during sequencing.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage example:

java -jar picard.jar MeanQualityByCycle \
I=input.bam \
O=mean_qual_by_cycle.txt \
CHART=mean_qual_by_cycle.pdf

OptionDescription
CHART_OUTPUT (File)A file (with .pdf extension) to write the chart to. Required.
ALIGNED_READS_ONLY (Boolean)If set to true, calculate mean quality over aligned reads only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
PF_READS_ONLY (Boolean)If set to true calculate mean quality over PF reads only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

MergeBamAlignment

Merge alignment data from a SAM or BAM with data in an unmapped BAM file. This tool produces a new SAM or BAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of alignment). The purpose of this tool is to use information from the unmapped BAM to fix up aligner output. The resulting file will be valid for use by other Picard tools. For simple BAM file merges, use MergeSamFiles. Note that MergeBamAlignment expects to find a sequence dictionary in the same directory as REFERENCE_SEQUENCE and expects it to have the same base name as the reference FASTA except with the extension ".dict". If the output sort order is not coordinate, then reads that are clipped due to adapters or overlapping will not contain the NM, MD, or UQ tags.

Usage example:

java -jar picard.jar MergeBamAlignment \
ALIGNED=aligned.bam \
UNMAPPED=unmapped.bam \
O=merge_alignments.bam \
R=reference_sequence.fasta

OptionDescription
UNMAPPED_BAM (File)Original SAM or BAM file of unmapped reads, which must be in queryname order. Required.
ALIGNED_BAM (File)SAM or BAM file(s) with alignment data. Default value: null. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) READ1_ALIGNED_BAM (R1_ALIGNED) READ2_ALIGNED_BAM (R2_ALIGNED)
READ1_ALIGNED_BAM (File)SAM or BAM file(s) with alignment data from the first read of a pair. Default value: null. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) ALIGNED_BAM (ALIGNED)
READ2_ALIGNED_BAM (File)SAM or BAM file(s) with alignment data from the second read of a pair. Default value: null. This option may be specified 0 or more times. Cannot be used in conjuction with option(s) ALIGNED_BAM (ALIGNED)
OUTPUT (File)Merged SAM or BAM file to write to. Required.
REFERENCE_SEQUENCE (File)Path to the fasta file for the reference sequence. Required.
PROGRAM_RECORD_ID (String)The program group ID of the aligner (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_VERSION (String)The version of the program group (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_COMMAND_LINE (String)The command line of the program group (if not supplied by the aligned file). Default value: null.
PROGRAM_GROUP_NAME (String)The name of the program group (if not supplied by the aligned file). Default value: null.
PAIRED_RUN (Boolean)This argument is ignored and will be removed. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
JUMP_SIZE (Integer)The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead Default value: null. Cannot be used in conjuction with option(s) EXPECTED_ORIENTATIONS (ORIENTATIONS)
CLIP_ADAPTERS (Boolean)Whether to clip adapters where identified. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
IS_BISULFITE_SEQUENCE (Boolean)Whether the lane is bisulfite sequence (used when calculating the NM tag). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALIGNED_READS_ONLY (Boolean)Whether to output only aligned reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_INSERTIONS_OR_DELETIONS (Integer)The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions. Default value: 1. This option can be set to 'null' to clear the default value.
ATTRIBUTES_TO_RETAIN (String)Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging. Default value: null. This option may be specified 0 or more times.
ATTRIBUTES_TO_REMOVE (String)Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags. Default value: null. This option may be specified 0 or more times.
ATTRIBUTES_TO_REVERSE (String)Attributes on negative strand reads that need to be reversed. Default value: [OQ, U2]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
ATTRIBUTES_TO_REVERSE_COMPLEMENT (String)Attributes on negative strand reads that need to be reverse complemented. Default value: [E2, SQ]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
READ1_TRIM (Integer)The number of bases trimmed from the beginning of read 1 prior to alignment Default value: 0. This option can be set to 'null' to clear the default value.
READ2_TRIM (Integer)The number of bases trimmed from the beginning of read 2 prior to alignment Default value: 0. This option can be set to 'null' to clear the default value.
EXPECTED_ORIENTATIONS (PairOrientation)The expected orientation of proper read pairs. Replaces JUMP_SIZE Default value: null. Possible values: {FR, RF, TANDEM} This option may be specified 0 or more times. Cannot be used in conjuction with option(s) JUMP_SIZE (JUMP)
ALIGNER_PROPER_PAIR_FLAGS (Boolean)Use the aligner's idea of what a proper pair is rather than computing in this program. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
SORT_ORDER (SortOrder)The order in which the merged reads should be output. Default value: coordinate. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
PRIMARY_ALIGNMENT_STRATEGY (PrimaryAlignmentStrategy)Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. BestMapq expects that multiple alignments will be correlated with HI tag, and prefers the pair of alignments with the largest MAPQ, in the absence of a primary selected by the aligner. EarliestFragment prefers the alignment which maps the earliest base in the read. Note that EarliestFragment may not be used for paired reads. BestEndMapq is appropriate for cases in which the aligner is not pair-aware, and does not output the HI tag. It simply picks the alignment for each end with the highest MAPQ, and makes those alignments primary, regardless of whether the two alignments make sense together.MostDistant is also for a non-pair-aware aligner, and picks the alignment pair with the largest insert size. If all alignments would be chimeric, it picks the alignments for each end with the best MAPQ. For all algorithms, ties are resolved arbitrarily. Default value: BestMapq. This option can be set to 'null' to clear the default value. Possible values: {BestMapq, EarliestFragment, BestEndMapq, MostDistant}
CLIP_OVERLAPPING_READS (Boolean)For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_SECONDARY_ALIGNMENTS (Boolean)If false, do not write secondary alignments to output. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ADD_MATE_CIGAR (Boolean)Adds the mate CIGAR tag (MC) if true, does not if false. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
UNMAP_CONTAMINANT_READS (Boolean)Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MIN_UNCLIPPED_BASES (Integer)If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant. Default value: 32. This option can be set to 'null' to clear the default value.
MATCHING_DICTIONARY_TAGS (String)List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise. Default value: [M5, LN]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
UNMAPPED_READ_STRATEGY (UnmappingReadStrategy)How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true Default value: DO_NOT_CHANGE. This option can be set to 'null' to clear the default value. Possible values: {COPY_TO_TAG, DO_NOT_CHANGE, MOVE_TO_TAG}

MergeSamFiles

Merges multiple SAM and/or BAM files into a single file. This tool is used for combining SAM and/or BAM files from different runs or read groups, similarly to the "merge" function of Samtools (http://www.htslib.org/doc/samtools.html).

Note that to prevent errors in downstream processing, it is critical to identify/label read groups appropriately. If different samples contain identical read group IDs, this tool will avoid collisions by modifying the read group IDs to be unique. For more information about read groups, see the GATK Dictionary entry.


Usage example:

java -jar picard.jar MergeSamFiles \
I=input_1.bam \
I=input_2.bam \
O=merged_files.bam

OptionDescription
INPUT (File)SAM or BAM input file Default value: null. This option must be specified at least 1 times.
OUTPUT (File)SAM or BAM file to write merged result to Required.
SORT_ORDER (SortOrder)Sort order of output file Default value: coordinate. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
ASSUME_SORTED (Boolean)If true, assume that the input files are in the same sort order as the requested output sort order, even if their headers say otherwise. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MERGE_SEQUENCE_DICTIONARIES (Boolean)Merge the sequence dictionaries Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
USE_THREADING (Boolean)Option to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
COMMENT (String)Comment(s) to include in the merged output file's header. Default value: null. This option may be specified 0 or more times.
INTERVALS (File)An interval list file that contains the locations of the positions to merge. Assume bam are sorted and indexed. The resulting file will contain alignments that may overlap with genomic regions outside the requested region. Unmapped reads are discarded. Default value: null.

MergeVcfs

Merges multiple VCF or BCF files into one VCF file. Input files must be sorted by their contigs and, within contigs, by start position. The input files must have the same sample and contig lists. An index file is created and a sequence dictionary is required by default.

OptionDescription
INPUT (File)VCF or BCF input files (File format is determined by file extension), or a file having a '.list' suffix containing the path to the files. Default value: null. This option must be specified at least 1 times.
OUTPUT (File)The merged VCF or BCF file. File format is determined by file extension. Required.
SEQUENCE_DICTIONARY (File)The index sequence dictionary to use instead of the sequence dictionary in the input file Default value: null.

NormalizeFasta

Normalizes lines of sequence in a FASTA file to be of the same length.This tool takes any FASTA-formatted file and reformats the sequence to ensure that all of the sequence record lines are of the same length (with the exception of the last line). Although the default setting is 100 bases per line, a custom line_length can be specified by the user. In addition, record names can be truncated at the first instance of a whitespace character to ensure downstream compatibility.

Usage example:

java -jar picard.jar NormalizeFasta \
I=input_sequence.fasta \
O=normalized_sequence.fasta

OptionDescription
INPUT (File)The input FASTA file to normalize. Required.
OUTPUT (File)The output FASTA file to write. Required.
LINE_LENGTH (Integer)The line length to be used for the output FASTA file. Default value: 100. This option can be set to 'null' to clear the default value.
TRUNCATE_SEQUENCE_NAMES_AT_WHITESPACE (Boolean)Truncate sequence names at first whitespace. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

PositionBasedDownsampleSam

Class to downsample a BAM file while respecting that we should either get rid of both ends of a pair or neither end of the pair. In addition, this program uses the read-name and extracts the position within the tile whence the read came from. The downsampling is based on this position. Results with the exact same input will produce the same results. Note 1: This is technology and read-name dependent. If your read-names do not have coordinate information, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this will not work properly. This has been designed with Illumina MiSeq/HiSeq in mind. Note 2: The downsampling is not random. It is deterministically dependent on the position of the read within its tile. Note 3: Downsampling twice with this program is not supported. Note 4: You should call MarkDuplicates after downsampling. Finally, the code has been designed to simulate sequencing less as accurately as possible, not for getting an exact downsample fraction. In particular, since the reads may be distributed non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input argument FRACTION.

OptionDescription
INPUT (File)The input SAM or BAM file to downsample. Required.
OUTPUT (File)The output, downsampled, SAM or BAM file to write. Required.
FRACTION (Double)The (approximate) fraction of reads to be kept, between 0 and 1. Required.
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: null.
ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS (Boolean)Allow Downsampling again despite this being a bad idea with possibly unexpected results. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REMOVE_DUPLICATE_INFORMATION (Boolean)Determines whether the duplicate tag should be reset since the downsampling requires re-marking duplicates. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}

ExtractSequences

Subsets intervals from a reference sequence to a new FASTA file.This tool takes a list of intervals, reads the corresponding subsquences from a reference FASTA file and writes them to a new FASTA file as separate records. Note that the reference FASTA file must be accompanied by an index file and the interval list must be provided in Picard list format. The names provided for the intervals will be used to name the corresponding records in the output file.

Usage example:

java -jar picard.jar ExtractSequences \
INTERVAL_LIST=regions_of_interest.interval_list \
R=reference.fasta \
O=extracted_IL_sequences.fasta

OptionDescription
INTERVAL_LIST (File)Interval list describing intervals to be extracted from the reference sequence. Required.
REFERENCE_SEQUENCE (File)Reference sequence FASTA file. Required.
OUTPUT (File)Output FASTA file. Required.
LINE_LENGTH (Integer)Maximum line length for sequence data. Default value: 80. This option can be set to 'null' to clear the default value.

QualityScoreDistribution

Chart the distribution of quality scores.

This tool is used for determining the overall 'quality' for a library in a given run. To that effect, it outputs a chart and tables indicating the range of quality scores and the total numbers of bases corresponding to those scores. Options include plotting the distribution of all of the reads, only the aligned reads, or reads that have passed the Illumina Chastity filter thresholds as described here.

Note on base quality score options

If the quality score of read bases has been modified in a previous data processing step such as GATK Base Recalibration and an OQ tag is available, this tool can be set to plot the OQ value as well as the primary quality value for the evaluation.

Note: Metrics labeled as percentages are actually expressed as fractions!

Usage Example:

java -jar picard.jar QualityScoreDistribution \
I=input.bam \
O=qual_score_dist.txt \
CHART=qual_score_dist.pdf

OptionDescription
CHART_OUTPUT (File)A file (with .pdf extension) to write the chart to. Required.
ALIGNED_READS_ONLY (Boolean)If set to true calculate mean quality over aligned reads only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
PF_READS_ONLY (Boolean)If set to true calculate mean quality over PF reads only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_NO_CALLS (Boolean)If set to true, include quality for no-call bases in the distribution. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INPUT (File)Input SAM or BAM file. Required.
OUTPUT (File)File to write the output to. Required.
ASSUME_SORTED (Boolean)If true (default), then the sort order in the header file will be ignored. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
STOP_AFTER (Long)Stop after processing N reads, mainly for debugging. Default value: 0. This option can be set to 'null' to clear the default value.

RenameSampleInVcf

Renames a sample within a VCF or BCF. This tool enables the user to rename a sample in either a VCF or BCF file. It is intended to change the name of a sample in a VCF prior to merging with VCF files in which one or more samples have similar names. Note that the input VCF file must be single-sample VCF and that the NEW_SAMPLE_NAME is required.

Usage example:

java -jar picard.jar RenameSampleInVcf \
I=input.vcf \
O=renamed.vcf \
NEW_SAMPLE_NAME=sample123

OptionDescription
INPUT (File)Input single sample VCF. Required.
OUTPUT (File)Output single sample VCF. Required.
OLD_SAMPLE_NAME (String)Existing name of sample in VCF; if provided, asserts that that is the name of the extant sample name Default value: null.
NEW_SAMPLE_NAME (String)New name to give sample in output VCF. Required.

ReorderSam

Not to be confused with SortSam which sorts a SAM or BAM file with a valid sequence dictionary, ReorderSam reorders reads in a SAM/BAM file to match the contig ordering in a provided reference file, as determined by exact name matching of contigs. Reads mapped to contigs absent in the new reference are dropped. Runs substantially faster if the input is an indexed BAM file.

OptionDescription
INPUT (File)Input file (bam or sam) to extract reads from. Required.
OUTPUT (File)Output file (bam or sam) to write extracted reads to. Required.
REFERENCE (File)Reference sequence to reorder reads to match. A sequence dictionary corresponding to the reference fasta is required. Create one with CreateSequenceDictionary.jar. Required.
ALLOW_INCOMPLETE_DICT_CONCORDANCE (Boolean)If true, then allows only a partial overlap of the BAM contigs with the new reference sequence contigs. By default, this tool requires a corresponding contig in the new reference for each read contig Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ALLOW_CONTIG_LENGTH_DISCORDANCE (Boolean)If true, then permits mapping from a read contig to a new reference contig with the same name but a different length. Highly dangerous, only use if you know what you are doing. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

ReplaceSamHeader

Replaces the SAMFileHeader in a SAM or BAM file. This tool makes it possible to replace the header of a SAM or BAM file with the header of anotherfile, or a header block that has been edited manually (in a stub SAM file). The sort order (@SO) of the two input files must be the same.

Note that validation is minimal, so it is up to the user to ensure that all the elements referred to in the SAMRecords are present in the new header.

Usage example:

java -jar picard.jar ReplaceSamHeader \
I=input_1.bam \
HEADER=input_2.bam \
O=bam_with_new_head.bam

OptionDescription
INPUT (File)SAM file from which SAMRecords will be read. Required.
HEADER (File)SAM file from which SAMFileHeader will be read. Required.
OUTPUT (File)SAMFileHeader from HEADER file will be written to this file, followed by SAMRecords from INPUT file Required.

RevertSam

Reverts SAM or BAM files to a previous state. This tool removes or restores certain properties of the SAM records, including alignment information, which can be used to produce an unmapped BAM (uBAM) from a previously aligned BAM. It is also capable of restoring the original quality scores of a BAM file that has already undergone base quality score recalibration (BQSR) if theoriginal qualities were retained.

Example with single output:

java -jar picard.jar RevertSam \
I=input.bam \
O=reverted.bam
Output format is BAM by default, or SAM or CRAM if the input path ends with '.sam' or '.cram', respectively.

Example outputting by read group with output map:

java -jar picard.jar RevertSam \
I=input.bam \
OUTPUT_BY_READGROUP=true \
OUTPUT_MAP=reverted_bam_paths.tsv
Will output a BAM/SAM file per read group. By default, all outputs will be in BAM format. However, a SAM file will be produced instead for any read group mapped in OUTPUT_MAP to a path ending with '.sam'. A CRAM file will be produced for any read group mapped to a path ending with '.cram'.

Example outputting by read group without output map:

java -jar picard.jar RevertSam \
I=input.bam \
OUTPUT_BY_READGROUP=true \
O=/write/reverted/read/group/bams/in/this/dir
Will output a BAM/SAM file per read group. By default, all outputs will be in BAM format. However, outputs will be in SAM format if the input path ends with '.sam', or CRAM format if it ends with '.cram'. This behaviour can be overriden with OUTPUT_BY_READGROUP_FILE_FORMAT option.

Note: If the program fails due to a SAM validation error, consider setting the VALIDATION_STRINGENCY option to LENIENT or SILENT if the failures are expected to be obviated by the reversion process (e.g. invalid alignment information will be obviated when the REMOVE_ALIGNMENT_INFORMATION option is used).


OptionDescription
INPUT (File)The input SAM/BAM file to revert the state of. Required.
OUTPUT (File)The output SAM/BAM file to create, or an output directory if OUTPUT_BY_READGROUP is true. Required. Cannot be used in conjuction with option(s) OUTPUT_MAP (OM)
OUTPUT_MAP (File)Tab separated file with two columns, READ_GROUP_ID and OUTPUT, providing file mapping only used if OUTPUT_BY_READGROUP is true. Required. Cannot be used in conjuction with option(s) OUTPUT (O)
OUTPUT_BY_READGROUP (Boolean)When true, outputs each read group in a separate file. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
OUTPUT_BY_READGROUP_FILE_FORMAT (FileType)When using OUTPUT_BY_READGROUP, the output file format can be set to a certain format. Default value: dynamic. This option can be set to 'null' to clear the default value. Possible values: {sam, bam, cram, dynamic}
SORT_ORDER (SortOrder)The sort order to create the reverted output file with. Default value: queryname. This option can be set to 'null' to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RESTORE_ORIGINAL_QUALITIES (Boolean)True to restore original qualities from the OQ field to the QUAL field if available. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REMOVE_DUPLICATE_INFORMATION (Boolean)Remove duplicate read flags from all reads. Note that if this is true and REMOVE_ALIGNMENT_INFORMATION==false, the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REMOVE_ALIGNMENT_INFORMATION (Boolean)Remove all alignment information from the file. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ATTRIBUTE_TO_CLEAR (String)When removing alignment information, the set of optional tags to remove. Default value: [NM, UQ, PG, MD, MQ, SA, MC, AS]. This option can be set to 'null' to clear the default value. This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
SANITIZE (Boolean)WARNING: This option is potentially destructive. If enabled will discard reads in order to produce a consistent output BAM. Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities. This option can only be enabled if the output sort order is queryname and will always cause sorting to occur. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_DISCARD_FRACTION (Double)If SANITIZE=true and higher than MAX_DISCARD_FRACTION reads are discarded due to sanitization thenthe program will exit with an Exception instead of exiting cleanly. Output BAM will still be valid. Default value: 0.01. This option can be set to 'null' to clear the default value.
SAMPLE_ALIAS (String)The sample alias to use in the reverted output file. This will override the existing sample alias in the file and is used only if all the read groups in the input file have the same sample alias Default value: null.
LIBRARY_NAME (String)The library name to use in the reverted output file. This will override the existing sample alias in the file and is used only if all the read groups in the input file have the same library name Default value: null.

RevertOriginalBaseQualitiesAndAddMateCigar

Reverts the original base qualities and adds the mate cigar tag to read-group BAMs.

OptionDescription
INPUT (File)The input SAM/BAM file to revert the state of. Required.
OUTPUT (File)The output SAM/BAM file to create. Required.
SORT_ORDER (SortOrder)The sort order to create the reverted output file with.By default, the sort order will be the same as the input. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RESTORE_ORIGINAL_QUALITIES (Boolean)True to restore original qualities from the OQ field to the QUAL field if available. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_RECORDS_TO_EXAMINE (Integer)The maximum number of records to examine to determine if we can exit early and not output, given that there are a no original base qualities (if we are to restore) and mate cigars exist. Set to 0 to never skip the file. Default value: 10000. This option can be set to 'null' to clear the default value.

SamFormatConverter

Convert a BAM file to a SAM file, or SAM to BAM. Input and output formats are determined by file extension.

OptionDescription
INPUT (File)The BAM or SAM file to parse. Required.
OUTPUT (File)The BAM or SAM output file. Required.

SamToFastq

Converts a SAM or BAM file to FASTQ. This tool extracts read sequences and base quality scores from the input SAM/BAM file and outputs them in FASTQ format. This can be used by way of a pipe to run BWA MEM on unmapped BAM (uBAM) files efficiently.

Usage example:

java -jar picard.jar SamToFastq \
I=input.bam \
FASTQ=output.fastq

OptionDescription
INPUT (File)Input SAM/BAM file to extract reads from Required.
FASTQ (File)Output FASTQ file (single-end fastq or, if paired, first end of the pair FASTQ). Required. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
SECOND_END_FASTQ (File)Output FASTQ file (if paired, second end of the pair FASTQ). Default value: null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
UNPAIRED_FASTQ (File)Output FASTQ file for unpaired reads; may only be provided in paired-FASTQ mode Default value: null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
OUTPUT_PER_RG (Boolean)Output a FASTQ file per read group (two FASTQ files per read group if the group is paired). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) SECOND_END_FASTQ (F2) UNPAIRED_FASTQ (FU) FASTQ (F)
RG_TAG (String)The read group tag (PU or ID) to be used to output a FASTQ file per read group. Default value: PU. This option can be set to 'null' to clear the default value.
OUTPUT_DIR (File)Directory in which to output the FASTQ file(s). Used only when OUTPUT_PER_RG is true. Default value: null.
RE_REVERSE (Boolean)Re-reverse bases and qualities of reads with negative strand flag set before writing them to FASTQ Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INTERLEAVE (Boolean)Will generate an interleaved fastq if paired, each line will have /1 or /2 to describe which end it came from Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INCLUDE_NON_PF_READS (Boolean)Include non-PF reads from the SAM file into the output FASTQ files. PF means 'passes filtering'. Reads whose 'not passing quality controls' flag is set are non-PF reads. See GATK Dictionary for more info. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CLIPPING_ATTRIBUTE (String)The attribute that stores the position at which the SAM record should be clipped Default value: null.
CLIPPING_ACTION (String)The action that should be taken with clipped reads: 'X' means the reads and qualities should be trimmed at the clipped position; 'N' means the bases should be changed to Ns in the clipped region; and any integer means that the base qualities should be set to that value in the clipped region. Default value: null.
CLIPPING_MIN_LENGTH (Integer)When performing clipping with the CLIPPING_ATTRIBUTE and CLIPPING_ACTION parameters, ensure that the resulting reads after clipping are at least CLIPPING_MIN_LENGTH bases long. If the original read is shorter than CLIPPING_MIN_LENGTH then the original read length will be maintained. Default value: 0. This option can be set to 'null' to clear the default value.
READ1_TRIM (Integer)The number of bases to trim from the beginning of read 1. Default value: 0. This option can be set to 'null' to clear the default value.
READ1_MAX_BASES_TO_WRITE (Integer)The maximum number of bases to write from read 1 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
READ2_TRIM (Integer)The number of bases to trim from the beginning of read 2. Default value: 0. This option can be set to 'null' to clear the default value.
READ2_MAX_BASES_TO_WRITE (Integer)The maximum number of bases to write from read 2 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
QUALITY (Integer)End-trim reads using the phred/bwa quality trimming algorithm and this quality. Default value: null.
INCLUDE_NON_PRIMARY_ALIGNMENTS (Boolean)If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

ScatterIntervalsByNs

Writes an interval list based on splitting a reference by Ns. This tool identifies positions in a reference where the bases are 'no-calls' and writes out an interval-list using the resulting coordinates. This can be used to create an interval list for whole genome sequence (WGS) for e.g. scatter-gather purposes, as an alternative to using fixed-length intervals. The number of contiguous nocalls that can be tolerated before creating a break is adjustable from the command line.

Usage example:

java -jar picard.jar ScatterIntervalsByNs \
R=reference_sequence.fasta \
OT=BOTH \
O=output.interval_list

OptionDescription
REFERENCE (File)Reference sequence to use. Note: this tool requires that the reference fasta has both an associated index and a dictionary. Required.
OUTPUT (File)Output file for interval list. Required.
OUTPUT_TYPE (OutputType)Type of intervals to output. Default value: BOTH. This option can be set to 'null' to clear the default value. Possible values: {N, ACGT, BOTH}
MAX_TO_MERGE (Integer)Maximal number of contiguous N bases to tolerate, thereby continuing the current ACGT interval. Default value: 1. This option can be set to 'null' to clear the default value.

SetNmMdAndUqTags

Fixes the NM, MD, and UQ tags in a SAM file. This tool takes in a SAM or BAM file (sorted by coordinate) and calculates the NM, MD, and UQ tags by comparing with the reference.
This may be needed when MergeBamAlignment was run with SORT_ORDER different from 'coordinate' and thus could not fix these tags then.

Usage example:

java -jar picard.jar SetNmMDAndUqTags \
I=sorted.bam \
O=fixed.bam \

OptionDescription
INPUT (File)The BAM or SAM file to fix. Required.
OUTPUT (File)The fixed BAM or SAM output file. Required.
IS_BISULFITE_SEQUENCE (Boolean)Whether the file contains bisulfite sequence (used when calculating the NM tag). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

SortSam

Sorts a SAM or BAM file. This tool sorts the input SAM or BAM file by coordinate, queryname (QNAME), or some other property of the SAM record. The SortOrder of a SAM/BAM file is found in the SAM file header tag @HD in the field labeled SO.

For a coordinate sorted SAM/BAM file, read alignments are sorted first by the reference sequence name (RNAME) field using the reference sequence dictionary (@SQ tag). Alignments within these subgroups are secondarily sorted using the left-most mapping position of the read (POS). Subsequent to this sorting scheme, alignments are listed arbitrarily.

For queryname-sorted alignments, all alignments are grouped using the queryname field but the alignments are not necessarily sorted within these groups. Reads having the same queryname are derived from the same template.

Usage example:

java -jar picard.jar SortSam \
I=input.bam \
O=sorted.bam \
SORT_ORDER=coordinate

OptionDescription
INPUT (File)The BAM or SAM file to sort. Required.
OUTPUT (File)The sorted BAM or SAM output file. Required.
SORT_ORDER (SortOrder)Sort order of output file Required. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}

SortVcf

Sorts one or more VCF files. This tool sorts the records in VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate. It can accept an external sequence dictionary. If no external dictionary is supplied, the VCF file headers of multiple inputs must have the same sequence dictionaries.

If running on multiple inputs (originating from e.g. some scatter-gather runs), the input files must contain the same sample names in the same column order.

Usage example:

java -jar picard.jar SortVcf \
I=vcf_1.vcf \
I=vcf_2.vcf \
O=sorted.vcf

OptionDescription
INPUT (File)Input VCF(s) to be sorted. Multiple inputs must have the same sample names (in order) Default value: null. This option may be specified 0 or more times.
OUTPUT (File)Output VCF to be written. Required.
SEQUENCE_DICTIONARY (File)Default value: null.

SplitSamByLibrary

Takes a SAM or BAM file and separates all the reads into one SAM or BAM file per library name. Reads that do not have a read group specified or whose read group does not have a library name are written to a file called 'unknown.' The format (SAM or BAM) of the output files matches that of the input file.

OptionDescription
INPUT (File)The SAM or BAM file to be split. Required.
OUTPUT (File)The directory where the library SAM or BAM files should be written (defaults to the current directory). Default value: /Users/samn/picard/.. This option can be set to 'null' to clear the default value.

UmiAwareMarkDuplicatesWithMateCigar

Identifies duplicate reads using information from read positions and UMIs.

This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. It is based on the MarkDuplicatesWithMateCigar tool, with added logic to leverage Unique Molecular Identifier (UMI) information.

In addition to assuming that all members of a duplicate set must have the same start and end position, it imposes thatthey must also have sufficiently similar UMIs. In this context, 'sufficiently similar' is parameterized by the command line argument MAX_EDIT_DISTANCE_TO_JOIN, which sets the edit distance between UMIs that will be considered to be part of the same original molecule. This logic allows for sequencing errors in UMIs.

This tool is NOT intended to be used on data without UMIs; for marking duplicates in non-UMI data, see MarkDuplicates or MarkDuplicatesWithMateCigar. Mixed data (where some reads have UMIs and others do not) is not supported.

OptionDescription
MAX_EDIT_DISTANCE_TO_JOIN (Integer)Largest edit distance that UMIs must have in order to be considered as coming from distinct source molecules. Default value: 1. This option can be set to 'null' to clear the default value.
UMI_METRICS_FILE (File)UMI Metrics Required.
UMI_TAG_NAME (String)Tag name to use for UMI Default value: RX. This option can be set to 'null' to clear the default value.
ASSIGNED_UMI_TAG (String)Tag name to use for assigned UMI Default value: MI. This option can be set to 'null' to clear the default value.
ALLOW_MISSING_UMIS (Boolean)FOR TESTING ONLY: allow for missing UMIs if data doesn't have UMIs. This option is intended to be used ONLY for testing the code. Use MarkDuplicatesWithMateCigar if data has no UMIs. Mixed data (where some reads have UMIs and others do not) is not supported. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP (Integer)This option is obsolete. ReadEnds will always be spilled to disk. Default value: 50000. This option can be set to 'null' to clear the default value.
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP (Integer)Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system. Default value: 8000. This option can be set to 'null' to clear the default value.
SORTING_COLLECTION_SIZE_RATIO (Double)This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number. Default value: 0.25. This option can be set to 'null' to clear the default value.
BARCODE_TAG (String)Barcode SAM tag (ex. BC for 10X Genomics) Default value: null.
READ_ONE_BARCODE_TAG (String)Read one barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
READ_TWO_BARCODE_TAG (String)Read two barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
TAG_DUPLICATE_SET_MEMBERS (Boolean)If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
REMOVE_SEQUENCING_DUPLICATES (Boolean)If true remove 'optical' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
TAGGING_POLICY (DuplicateTaggingPolicy)Determines how duplicate types are recorded in the DT optional attribute. Default value: DontTag. This option can be set to 'null' to clear the default value. Possible values: {DontTag, OpticalOnly, All}
INPUT (String)One or more input SAM or BAM files to analyze. Must be coordinate sorted. Default value: null. This option may be specified 0 or more times.
OUTPUT (File)The output file to write marked records to Required.
METRICS_FILE (File)File to write duplication metrics to Required.
REMOVE_DUPLICATES (Boolean)If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
ASSUME_SORTED (Boolean)If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO)
ASSUME_SORT_ORDER (SortOrder)If not null, assume that the input file has this order even if the header says otherwise. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown} Cannot be used in conjuction with option(s) ASSUME_SORTED (AS)
DUPLICATE_SCORING_STRATEGY (ScoringStrategy)The scoring strategy for choosing the non-duplicate among candidates. Default value: SUM_OF_BASE_QUALITIES. This option can be set to 'null' to clear the default value. Possible values: {SUM_OF_BASE_QUALITIES, TOTAL_MAPPED_REFERENCE_LENGTH, RANDOM}
PROGRAM_RECORD_ID (String)The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. Default value: MarkDuplicates. This option can be set to 'null' to clear the default value.
PROGRAM_GROUP_VERSION (String)Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. Default value: null.
PROGRAM_GROUP_COMMAND_LINE (String)Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. Default value: null.
PROGRAM_GROUP_NAME (String)Value of PN tag of PG record to be created. Default value: UmiAwareMarkDuplicatesWithMateCigar. This option can be set to 'null' to clear the default value.
COMMENT (String)Comment(s) to include in the output file's header. Default value: null. This option may be specified 0 or more times.
READ_NAME_REGEX (String)Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to 'null' to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE (Integer)The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to 'null' to clear the default value.

UpdateVcfSequenceDictionary

Takes a VCF and a second file that contains a sequence dictionary and updates the VCF with the new sequence dictionary.

OptionDescription
INPUT (File)Input VCF Required.
OUTPUT (File)Output VCF to be written. Required.
SEQUENCE_DICTIONARY (File)A Sequence Dictionary (can be read from one of the following file types (SAM, BAM, VCF, BCF, Interval List, Fasta, or Dict) Required.

VcfFormatConverter

Converts VCF to BCF or BCF to VCF. This tool converts files between the plain-text VCF format and its binary compressed equivalent, BCF. Input and output formats are determined by file extensions specified in the file names. For best results, it is recommended to ensure that an index file is present and set the REQUIRE_INDEX option to true.

Usage example:

java -jar picard.jar VcfFormatConverter \
I=input.vcf \
O=output.bcf \
REQUIRE_INDEX=true

OptionDescription
INPUT (File)The BCF or VCF input file. Required.
OUTPUT (File)The BCF or VCF output file name. Required.
REQUIRE_INDEX (Boolean)Fail if an index is not available for the input VCF/BCF Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}

MarkIlluminaAdapters

Reads a SAM or BAM file and rewrites it with new adapter-trimming tags.

This tool clears any existing adapter-trimming tags (XT:i:) in the optional tag region of a SAM file. The SAM/BAM file must be sorted by query name.

Outputs a metrics file histogram showing counts of bases_clipped per read.

Usage example:

java -jar picard.jar MarkIlluminaAdapters \
INPUT=input.sam \
METRICS=metrics.txt

OptionDescription
INPUT (File)Required.
OUTPUT (File)If output is not specified, just the metrics are generated Default value: null.
METRICS (File)Histogram showing counts of bases_clipped in how many reads Required.
MIN_MATCH_BASES_SE (Integer)The minimum number of bases to match over when clipping single-end reads. Default value: 12. This option can be set to 'null' to clear the default value.
MIN_MATCH_BASES_PE (Integer)The minimum number of bases to match over (per-read) when clipping paired-end reads. Default value: 6. This option can be set to 'null' to clear the default value.
MAX_ERROR_RATE_SE (Double)The maximum mismatch error rate to tolerate when clipping single-end reads. Default value: 0.1. This option can be set to 'null' to clear the default value.
MAX_ERROR_RATE_PE (Double)The maximum mismatch error rate to tolerate when clipping paired-end reads. Default value: 0.1. This option can be set to 'null' to clear the default value.
PAIRED_RUN (Boolean)DEPRECATED. Whether this is a paired-end run. No longer used. Default value: null. Possible values: {true, false}
ADAPTERS (IlluminaAdapterPair)Which adapters sequences to attempt to identify and clip. Default value: [INDEXED, DUAL_INDEXED, PAIRED_END]. This option can be set to 'null' to clear the default value. Possible values: {PAIRED_END, INDEXED, SINGLE_END, NEXTERA_V1, NEXTERA_V2, DUAL_INDEXED, FLUIDIGM, TRUSEQ_SMALLRNA, ALTERNATIVE_SINGLE_END} This option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
FIVE_PRIME_ADAPTER (String)For specifying adapters other than standard Illumina Default value: null.
THREE_PRIME_ADAPTER (String)For specifying adapters other than standard Illumina Default value: null.
ADAPTER_TRUNCATION_LENGTH (Integer)Adapters are truncated to this length to speed adapter matching. Set to a large number to effectively disable truncation. Default value: 30. This option can be set to 'null' to clear the default value.
PRUNE_ADAPTER_LIST_AFTER_THIS_MANY_ADAPTERS_SEEN (Integer)If looking for multiple adapter sequences, then after having seen this many adapters, shorten the list of sequences. Keep the adapters that were found most frequently in the input so far. Set to -1 if the input has a heterogeneous mix of adapters so shortening is undesirable. Default value: 100. This option can be set to 'null' to clear the default value.
NUM_ADAPTERS_TO_KEEP (Integer)If pruning the adapter list, keep only this many adapter sequences when pruning the list (plus any adapters that were tied with the adapters being kept). Default value: 1. This option can be set to 'null' to clear the default value.

SplitVcfs

Splits SNPs and INDELs into separate files. This tool reads in a VCF or BCF file and writes out the SNPs and INDELs it contains to separate files. The headers of the two output files will be identical and index files will be created for both outputs. If records other than SNPs or INDELs are present, set the STRICT option to "false", otherwise the tool will raise an exception and quit.

Usage example:

java -jar picard.jar SplitVcfs \
I=input.vcf \
SNP_OUTPUT=snp.vcf \
INDEL_OUTPUT=indel.vcf \
STRICT=false

OptionDescription
INPUT (File)The VCF or BCF input file Required.
SNP_OUTPUT (File)The VCF or BCF file to which SNP records should be written. The file format is determined by file extension. Required.
INDEL_OUTPUT (File)The VCF or BCF file to which indel records should be written. The file format is determined by file extension. Required.
SEQUENCE_DICTIONARY (File)The index sequence dictionary to use instead of the sequence dictionaries in the input files Default value: null.
STRICT (Boolean)If true an exception will be thrown if an event type other than SNP or indel is encountered Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}

ValidateSamFile

Validates a SAM or BAM file.

This tool reports on the validity of a SAM or BAM file relative to the SAM format specification. This is useful for troubleshooting errors encountered with other tools that may be caused by improper formatting, faulty alignments, incorrect flag values, etc.

By default, the tool runs in VERBOSE mode and will exit after finding 100 errors and output them to the console (stdout). Therefore, it is often more practical to run this tool initially using the MODE=SUMMARY option. This mode outputs a summary table listing the numbers of all 'errors' and 'warnings'.

When fixing errors in your file, it is often useful to prioritize the severe validation errors and ignore the errors/warnings of lesser concern. This can be done using the IGNORE and/or IGNORE_WARNINGS arguments. For helpful suggestions on error prioritization, please follow this link to obtain additional documentation on ValidateSamFile.

After identifying and fixing your 'warnings/errors', we recommend that you rerun this tool to validate your SAM/BAM file prior to proceeding with your downstream analysis. This will verify that all problems in your file have been addressed.

Usage example:

java -jar picard.jar ValidateSamFile \
I=input.bam \
MODE=SUMMARY

To obtain a complete list with descriptions of both 'ERROR' and 'WARNING' messages, please see our additional documentation for this tool.


OptionDescription
INPUT (File)Input SAM/BAM file Required.
OUTPUT (File)Output file or standard out if missing Default value: null.
MODE (Mode)Mode of output Default value: VERBOSE. This option can be set to 'null' to clear the default value. Possible values: {VERBOSE, SUMMARY}
IGNORE (Type)List of validation error types to ignore. Default value: null. Possible values: {INVALID_QUALITY_FORMAT, INVALID_FLAG_PROPER_PAIR, INVALID_FLAG_MATE_UNMAPPED, MISMATCH_FLAG_MATE_UNMAPPED, INVALID_FLAG_MATE_NEG_STRAND, MISMATCH_FLAG_MATE_NEG_STRAND, INVALID_FLAG_FIRST_OF_PAIR, INVALID_FLAG_SECOND_OF_PAIR, PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND, INVALID_FLAG_NOT_PRIM_ALIGNMENT, INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT, INVALID_FLAG_READ_UNMAPPED, INVALID_INSERT_SIZE, INVALID_MAPPING_QUALITY, INVALID_CIGAR, ADJACENT_INDEL_IN_CIGAR, INVALID_MATE_REF_INDEX, MISMATCH_MATE_REF_INDEX, INVALID_REFERENCE_INDEX, INVALID_ALIGNMENT_START, MISMATCH_MATE_ALIGNMENT_START, MATE_FIELD_MISMATCH, INVALID_TAG_NM, MISSING_TAG_NM, MISSING_HEADER, MISSING_SEQUENCE_DICTIONARY, MISSING_READ_GROUP, RECORD_OUT_OF_ORDER, READ_GROUP_NOT_FOUND, RECORD_MISSING_READ_GROUP, INVALID_INDEXING_BIN, MISSING_VERSION_NUMBER, INVALID_VERSION_NUMBER, TRUNCATED_FILE, MISMATCH_READ_LENGTH_AND_QUALS_LENGTH, EMPTY_READ, CIGAR_MAPS_OFF_REFERENCE, MISMATCH_READ_LENGTH_AND_E2_LENGTH, MISMATCH_READ_LENGTH_AND_U2_LENGTH, E2_BASE_EQUALS_PRIMARY_BASE, BAM_FILE_MISSING_TERMINATOR_BLOCK, UNRECOGNIZED_HEADER_TYPE, POORLY_FORMATTED_HEADER_TAG, HEADER_TAG_MULTIPLY_DEFINED, HEADER_RECORD_MISSING_REQUIRED_TAG, HEADER_TAG_NON_CONFORMING_VALUE, INVALID_DATE_STRING, TAG_VALUE_TOO_LARGE, INVALID_INDEX_FILE_POINTER, INVALID_PREDICTED_MEDIAN_INSERT_SIZE, DUPLICATE_READ_GROUP_ID, MISSING_PLATFORM_VALUE, INVALID_PLATFORM_VALUE, DUPLICATE_PROGRAM_GROUP_ID, MATE_NOT_FOUND, MATES_ARE_SAME_END, MISMATCH_MATE_CIGAR_STRING, MATE_CIGAR_STRING_INVALID_PRESENCE} This option may be specified 0 or more times.
MAX_OUTPUT (Integer)The maximum number of lines output in verbose mode Default value: 100. This option can be set to 'null' to clear the default value.
IGNORE_WARNINGS (Boolean)If true, only report errors and ignore warnings. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
VALIDATE_INDEX (Boolean)DEPRECATED. Use INDEX_VALIDATION_STRINGENCY instead. If true and input is a BAM file with an index file, also validates the index. Until this parameter is retired VALIDATE INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to validate the index. Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INDEX_VALIDATION_STRINGENCY (IndexValidationStringency)If set to anything other than IndexValidationStringency.NONE and input is a BAM file with an index file, also validates the index at the specified stringency. Until VALIDATE_INDEX is retired, VALIDATE INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to validate the index. Default value: EXHAUSTIVE. This option can be set to 'null' to clear the default value. Possible values: {EXHAUSTIVE, LESS_EXHAUSTIVE, NONE}
IS_BISULFITE_SEQUENCED (Boolean)Whether the SAM or BAM file consists of bisulfite sequenced reads. If so, C->T is not counted as an error in computing the value of the NM tag. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_OPEN_TEMP_FILES (Integer)Relevant for a coordinate-sorted file containing read pairs only. Maximum number of file handles to keep open when spilling mate info to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system. Default value: 8000. This option can be set to 'null' to clear the default value.

ViewSam

Prints a SAM or BAM file to the screen.

OptionDescription
INPUT (String)The SAM or BAM file or GA4GH url to view. Required.
ALIGNMENT_STATUS (AlignmentStatus)Print out all reads, just the aligned reads or just the unaligned reads. Default value: All. This option can be set to 'null' to clear the default value. Possible values: {Aligned, Unaligned, All}
PF_STATUS (PfStatus)Print out all reads, just the PF reads or just the non-PF reads. Default value: All. This option can be set to 'null' to clear the default value. Possible values: {PF, NonPF, All}
HEADER_ONLY (Boolean)Print the SAM header only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
RECORDS_ONLY (Boolean)Print the alignment records only. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
INTERVAL_LIST (File)An intervals file used to restrict what records are output. Default value: null.

VcfToIntervalList

Converts a VCF or BCF file to a Picard Interval List.

OptionDescription
INPUT (File)The BCF or VCF input file. The file format is determined by file extension. Required.
OUTPUT (File)The output Picard Interval List Required.