Picard

Build Status

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

View the Project on GitHub broadinstitute/picard

Picard Metrics Definitions

Click on a metric to see a description of its fields.

  1. AlignmentSummaryMetrics: High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
  2. BaseDistributionByCycleMetrics:
  3. CollectHiSeqXPfFailMetrics.PFFailDetailedMetric: a metric class for describing FP failing reads from an Illumina HiSeqX lane *
  4. CollectHiSeqXPfFailMetrics.PFFailSummaryMetric: Metrics produced by the GetHiSeqXPFFailMetrics program.
  5. CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.
  6. CollectQualityYieldMetrics.QualityYieldMetrics: A set of metrics used to describe the general quality of a BAM file
  7. CollectRawWgsMetrics.RawWgsMetrics:
  8. CollectVariantCallingMetrics.VariantCallingDetailMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.
  9. CollectVariantCallingMetrics.VariantCallingSummaryMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF).
  10. CollectWgsMetrics.WgsMetrics: Metrics for evaluating the performance of whole genome sequencing experiments.
  11. CollectWgsMetricsWithNonZeroCoverage.WgsMetricsWithNonZeroCoverage: Metrics for evaluating the performance of whole genome sequencing experiments.
  12. DuplicationMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
  13. ErrorSummaryMetrics: Summary metrics produced by CollectSequencingArtifactMetrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution.
  14. ExtractIlluminaBarcodes.BarcodeMetric: Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
  15. FingerprintingDetailMetrics: Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.
  16. FingerprintingSummaryMetrics: Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs.
  17. GcBiasDetailMetrics: Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
  18. GcBiasMetrics:
  19. GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.
  20. GenotypeConcordanceContingencyMetrics: Class that holds metrics about the Genotype Concordance contingency tables.
  21. GenotypeConcordanceDetailMetrics: Class that holds detail metrics about Genotype Concordance
  22. GenotypeConcordanceSummaryMetrics: Class that holds summary metrics about Genotype Concordance
  23. HsMetrics:

    Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments.

  24. IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.
  25. IlluminaLaneMetrics: Embodies characteristics that describe a lane.
  26. IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.
  27. IndependentReplicateMetric: A class to store information relevant for biological rate estimation
  28. InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
  29. JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".
  30. MendelianViolationMetrics: Describes the type and number of mendelian violations found within a Trio.
  31. MergeableMetricBase: An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated.
  32. MultilevelMetrics:
  33. RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
  34. RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC
  35. RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC
  36. SequencingArtifactMetrics.BaitBiasDetailMetrics: Bait bias artifacts broken down by context.
  37. SequencingArtifactMetrics.BaitBiasSummaryMetrics: Summary analysis of a single bait bias artifact, also known as a reference bias artifact.
  38. SequencingArtifactMetrics.PreAdapterDetailMetrics: Pre-adapter artifacts broken down by context.
  39. SequencingArtifactMetrics.PreAdapterSummaryMetrics: Summary analysis of a single pre-adapter artifact.
  40. TargetedPcrMetrics: Metrics class for the analysis of reads obtained from targeted pcr experiments e.g.
  41. UmiMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.

Note: Metrics labeled as percentages (with 'percent' in the full metric name or 'PCT' in the name given in the output file) are actually expressed as fractions. For example, 'PCT_TARGET_BASES_20X = 0.85' should be interpreted as '85 percent of targeted bases are covered to 20X coverage or more'.

AlignmentSummaryMetrics

High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".

FieldDescription
CATEGORYOne of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregated for both first and second reads in a pair.
TOTAL_READSThe total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.
PF_READSThe number of PF reads where PF is defined as passing Illumina's filter.
PCT_PF_READSThe fraction of reads that are PF (PF_READS / TOTAL_READS)
PF_NOISE_READSThe number of PF reads that are marked as noise reads. A noise read is one which is composed entirely of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis.
PF_READS_ALIGNEDThe number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous).
PCT_PF_READS_ALIGNEDThe percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS
PF_ALIGNED_BASESThe total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.
PF_HQ_ALIGNED_READSThe number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.
PF_HQ_ALIGNED_BASESThe number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps.
PF_HQ_ALIGNED_Q20_BASESThe subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.
PF_HQ_MEDIAN_MISMATCHESThe median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).
PF_MISMATCH_RATEThe rate of bases mismatching the reference for all bases aligned to the reference sequence.
PF_HQ_ERROR_RATEThe fraction of bases that mismatch the reference in PF HQ aligned reads.
PF_INDEL_RATEThe number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases.
MEAN_READ_LENGTHThe mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.
READS_ALIGNED_IN_PAIRSThe number of aligned reads whose mate pair was also aligned to the reference.
PCT_READS_ALIGNED_IN_PAIRSThe fraction of reads whose mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED
PF_READS_IMPROPER_PAIRSThe number of (primary) aligned reads that are **not** "properly" aligned in pairs (as per SAM flag 0x2).
PCT_PF_READS_IMPROPER_PAIRSThe fraction of (primary) reads that are *not* "properly" aligned in pairs (as per SAM flag 0x2). PF_READS_IMPROPER_PAIRS / PF_READS_ALIGNED
BAD_CYCLESThe number of instrument cycles in which 80% or more of base calls were no-calls.
STRAND_BALANCEThe number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.
PCT_CHIMERASThe fraction of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.
PCT_ADAPTERThe fraction of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.

BaseDistributionByCycleMetrics

FieldDescription
READ_END
CYCLE
PCT_A
PCT_C
PCT_G
PCT_T
PCT_N

CollectHiSeqXPfFailMetrics.PFFailDetailedMetric

a metric class for describing FP failing reads from an Illumina HiSeqX lane *

FieldDescription
TILEThe Tile that is described by this metric
XThe X coordinate of the read within the tile
YThe Y coordinate of the read within the tile
NUM_NThe number of Ns found in this read
NUM_Q_GT_TWOThe number of Quality scores greater than 2 found in this read
CLASSIFICATIONThe classification of this read: {EMPTY, POLYCLONAL, MISALIGNED, UNKNOWN} (See PFFailSummaryMetric for explanation regarding the possible classification.)

CollectHiSeqXPfFailMetrics.PFFailSummaryMetric

Metrics produced by the GetHiSeqXPFFailMetrics program. Used to diagnose lanes from HiSeqX Sequencing, providing the number and fraction of each of the reasons that reads could have not passed PF. Possible reasons are EMPTY (reads from empty wells with no template strand), POLYCLONAL (reads from wells that had more than one strand cloned in them), MISALIGNED (reads from wells that are near the edge of the tile), UNKNOWN (reads that didn't pass PF but couldn't be diagnosed)

FieldDescription
TILEThe Tile that is described by this metric. Can be a string (like "All") to mean some marginal over tiles. *
READSThe total number of reads examined
PF_FAIL_READSThe number of non-PF reads in this tile.
PCT_PF_FAIL_READSThe fraction of PF_READS
PF_FAIL_EMPTYThe number of non-PF reads in this tile that are deemed empty.
PCT_PF_FAIL_EMPTYThe fraction of non-PF reads in this tile that are deemed empty (as fraction of all non-PF reads).
PF_FAIL_POLYCLONALThe number of non-PF reads in this tile that are deemed multiclonal.
PCT_PF_FAIL_POLYCLONALThe fraction of non-PF reads in this tile that are deemed multiclonal (as fraction of all non-PF reads).
PF_FAIL_MISALIGNEDThe number of non-PF reads in this tile that are deemed "misaligned".
PCT_PF_FAIL_MISALIGNEDThe fraction of non-PF reads in this tile that are deemed "misaligned" (as fraction of all non-PF reads).
PF_FAIL_UNKNOWNThe number of non-PF reads in this tile that have not been classified.
PCT_PF_FAIL_UNKNOWNThe fraction of non-PF reads in this tile that have not been classified (as fraction of all non-PF reads).

CollectOxoGMetrics.CpcgMetrics

Metrics class for outputs.

FieldDescription
SAMPLE_ALIASThe name of the sample being assayed.
LIBRARYThe name of the library being assayed.
CONTEXTThe sequence context being reported on.
TOTAL_SITESThe total number of sites that had at least one base covering them.
TOTAL_BASESThe total number of basecalls observed at all sites.
REF_NONOXO_BASESThe number of reference alleles observed as C in read 1 and G in read 2.
REF_OXO_BASESThe number of reference alleles observed as G in read 1 and C in read 2.
REF_TOTAL_BASESThe total number of reference alleles observed
ALT_NONOXO_BASESThe count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that rules out oxidation as the cause
ALT_OXO_BASESThe count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that is consistent with oxidative damage.
OXIDATION_ERROR_RATEThe oxo error rate, calculated as max(ALT_OXO_BASES - ALT_NONOXO_BASES, 1) / TOTAL_BASES
OXIDATION_Q-10 * log10(OXIDATION_ERROR_RATE)
C_REF_REF_BASESThe number of ref basecalls observed at sites where the genome reference == C.
G_REF_REF_BASESThe number of ref basecalls observed at sites where the genome reference == G.
C_REF_ALT_BASESThe number of alt (A/T) basecalls observed at sites where the genome reference == C.
G_REF_ALT_BASESThe number of alt (A/T) basecalls observed at sites where the genome reference == G.
C_REF_OXO_ERROR_RATEThe rate at which C>A and G>T substitutions are observed at C reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
C_REF_OXO_QC_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
G_REF_OXO_ERROR_RATEThe rate at which C>A and G>T substitutions are observed at G reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
G_REF_OXO_QG_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.

CollectQualityYieldMetrics.QualityYieldMetrics

A set of metrics used to describe the general quality of a BAM file

FieldDescription
TOTAL_READSThe total number of reads in the input file
PF_READSThe number of reads that are PF - pass filter
READ_LENGTHThe average read length of all the reads (will be fixed for a lane)
TOTAL_BASESThe total number of bases in all reads
PF_BASESThe total number of bases in all PF reads
Q20_BASESThe number of bases in all reads that achieve quality score 20 or higher
PF_Q20_BASESThe number of bases in PF reads that achieve quality score 20 or higher
Q30_BASESThe number of bases in all reads that achieve quality score 30 or higher
PF_Q30_BASESThe number of bases in PF reads that achieve quality score 30 or higher
Q20_EQUIVALENT_YIELDThe sum of quality scores of all bases divided by 20
PF_Q20_EQUIVALENT_YIELDThe sum of quality scores of all bases divided by 20

CollectRawWgsMetrics.RawWgsMetrics

FieldDescription

CollectVariantCallingMetrics.VariantCallingDetailMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.

FieldDescription
SAMPLE_ALIASThe name of the sample being assayed
HET_HOMVAR_RATIO(count of hets)/(count of homozygous non-ref) for this sample
PCT_GQ0_VARIANTSThe percentage of variants in a particular sample that have a GQ score of 0.
TOTAL_GQ0_VARIANTSThe total number of variants in a particular sample that have a GQ score of 0.
TOTAL_HET_DEPTHtotal number of reads (from AD field) for passing bi-allelic SNP hets for this sample

CollectVariantCallingMetrics.VariantCallingSummaryMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF).

FieldDescription
TOTAL_SNPSThe number of passing bi-allelic SNPs calls (i.e. non-reference genotypes) that were examined
NUM_IN_DB_SNPThe number of passing bi-allelic SNPs found in dbSNP
NOVEL_SNPSThe number of passing bi-allelic SNPS called that were not found in dbSNP
FILTERED_SNPSThe number of SNPs that are filtered
PCT_DBSNPThe fraction of passing bi-allelic SNPs in dbSNP
DBSNP_TITVThe Transition/Transversion ratio of the passing bi-allelic SNP calls made at dbSNP sites
NOVEL_TITVThe Transition/Transversion ratio of the passing bi-allelic SNP calls made at non-dbSNP sites
TOTAL_INDELSThe number of passing indel calls that were examined
NOVEL_INDELSThe number of passing indels called that were not found in dbSNP
FILTERED_INDELSThe number of indels that are filtered
PCT_DBSNP_INDELSThe fraction of passing indels in dbSNP
NUM_IN_DB_SNP_INDELSThe number of passing indels found in dbSNP
DBSNP_INS_DEL_RATIOThe Insertion/Deletion ratio of the indel calls made at dbSNP sites
NOVEL_INS_DEL_RATIOThe Insertion/Deletion ratio of the indel calls made at non-dbSNP sites
TOTAL_MULTIALLELIC_SNPSThe number of passing multi-allelic SNP calls that were examined
NUM_IN_DB_SNP_MULTIALLELICThe number of passing multi-allelic SNPs found in dbSNP
TOTAL_COMPLEX_INDELSThe number of passing complex indel calls that were examined
NUM_IN_DB_SNP_COMPLEX_INDELSThe number of passing complex indels found in dbSNP
SNP_REFERENCE_BIASThe rate at which reference bases are observed at ref/alt heterozygous SNP sites.
NUM_SINGLETONSFor summary metrics, the number of variants that appear in only one sample. For detail metrics, the number of variants that appear only in the current sample.

CollectWgsMetrics.WgsMetrics

Metrics for evaluating the performance of whole genome sequencing experiments.

FieldDescription
GENOME_TERRITORYThe number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGEThe mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGEThe standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGEThe median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGEThe median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQThe fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).
PCT_EXC_DUPEThe fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_UNPAIREDThe fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.
PCT_EXC_BASEQThe fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).
PCT_EXC_OVERLAPThe fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_CAPPEDThe fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).
PCT_EXC_TOTALThe total fraction of aligned bases excluded due to all filters.
PCT_1XThe fraction of bases that attained at least 1X sequence coverage in post-filtering bases.
PCT_5XThe fraction of bases that attained at least 5X sequence coverage in post-filtering bases.
PCT_10XThe fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
PCT_15XThe fraction of bases that attained at least 15X sequence coverage in post-filtering bases.
PCT_20XThe fraction of bases that attained at least 20X sequence coverage in post-filtering bases.
PCT_25XThe fraction of bases that attained at least 25X sequence coverage in post-filtering bases.
PCT_30XThe fraction of bases that attained at least 30X sequence coverage in post-filtering bases.
PCT_40XThe fraction of bases that attained at least 40X sequence coverage in post-filtering bases.
PCT_50XThe fraction of bases that attained at least 50X sequence coverage in post-filtering bases.
PCT_60XThe fraction of bases that attained at least 60X sequence coverage in post-filtering bases.
PCT_70XThe fraction of bases that attained at least 70X sequence coverage in post-filtering bases.
PCT_80XThe fraction of bases that attained at least 80X sequence coverage in post-filtering bases.
PCT_90XThe fraction of bases that attained at least 90X sequence coverage in post-filtering bases.
PCT_100XThe fraction of bases that attained at least 100X sequence coverage in post-filtering bases.
HET_SNP_SENSITIVITYThe theoretical HET SNP sensitivity.
HET_SNP_QThe Phred Scaled Q Score of the theoretical HET SNP sensitivity.

CollectWgsMetricsWithNonZeroCoverage.WgsMetricsWithNonZeroCoverage

Metrics for evaluating the performance of whole genome sequencing experiments.

FieldDescription
CATEGORYOne of either WHOLE_GENOME or NON_ZERO_REGIONS

DuplicationMetrics

Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.

FieldDescription
LIBRARYThe library on which the duplicate marking was performed.
UNPAIRED_READS_EXAMINEDThe number of mapped reads examined which did not have a mapped mate pair, either because the read is unpaired, or the read is paired to an unmapped mate.
READ_PAIRS_EXAMINEDThe number of mapped read pairs examined. (Primary, non-supplemental)
SECONDARY_OR_SUPPLEMENTARY_RDSThe number of reads that were either secondary or supplementary
UNMAPPED_READSThe total number of unmapped reads examined. (Primary, non-supplemental)
UNPAIRED_READ_DUPLICATESThe number of fragments that were marked as duplicates.
READ_PAIR_DUPLICATESThe number of read pairs that were marked as duplicates.
READ_PAIR_OPTICAL_DUPLICATESThe number of read pairs duplicates that were caused by optical duplication. Value is always < READ_PAIR_DUPLICATES, which counts all duplicates regardless of source.
PERCENT_DUPLICATIONThe fraction of mapped sequence that is marked as duplicate.
ESTIMATED_LIBRARY_SIZEThe estimated number of unique molecules in the library based on PE duplication.

ErrorSummaryMetrics

Summary metrics produced by {@link CollectSequencingArtifactMetrics} as a roll up of the context-specific error rates, to provide global error rates per type of base substitution. Errors are normalized to the lexically lower reference base and summarized together. E.g. G>T is converted to C>A and merged with data from C>A for reporting.

FieldDescription
REF_BASEThe reference base (or it's complement).
ALT_BASEThe alternative base (or it's complement).
SUBSTITUTIONA single string representing the substition from REF_BASE to ALT_BASE for convenience.
REF_COUNTThe number of reference bases observed.
ALT_COUNTThe number of alt bases observed.
SUBSTITUTION_RATEThe rate of the substitution in question.

ExtractIlluminaBarcodes.BarcodeMetric

Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.

FieldDescription
BARCODEThe barcode (from the set of expected barcodes) for which the following metrics apply. Note that the "symbolic" barcode of NNNNNN is used to report metrics for all reads that do not match a barcode.
BARCODE_NAMEThe barcode name.
LIBRARY_NAMEThe name of the library
READSThe total number of reads matching the barcode.
PF_READSThe number of PF reads matching this barcode (always less than or equal to READS).
PERFECT_MATCHESThe number of all reads matching this barcode that matched with 0 errors or no-calls.
PF_PERFECT_MATCHESThe number of PF reads matching this barcode that matched with 0 errors or no-calls.
ONE_MISMATCH_MATCHESThe number of all reads matching this barcode that matched with 1 error or no-call.
PF_ONE_MISMATCH_MATCHESThe number of PF reads matching this barcode that matched with 1 error or no-call.
PCT_MATCHESThe fraction of all reads in the lane that matched to this barcode.
RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCTThe rate of all reads matching this barcode to all reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.
PF_PCT_MATCHESThe fraction of PF reads in the lane that matched to this barcode.
PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCTThe rate of PF reads matching this barcode to PF reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation of PF reads between barcodes.
PF_NORMALIZED_MATCHESThe "normalized" matches to each barcode. This is calculated as the number of pf reads matching this barcode over the sum of all pf reads matching any barcode (excluding orphans). If all barcodes are represented equally this will be 1.

FingerprintingDetailMetrics

Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.

FieldDescription
READ_GROUPThe sequencing read group from which sequence data was fingerprinted.
SAMPLEThe name of the sample who's genotypes the sequence data was compared to.
SNPThe name of a representative SNP within the haplotype that was compared. Will usually be the exact SNP that was genotyped externally.
SNP_ALLELESThe possible alleles for the SNP.
CHROMThe chromosome on which the SNP resides.
POSITIONThe position of the SNP on the chromosome.
EXPECTED_GENOTYPEThe expected genotype of the sample at the SNP locus.
OBSERVED_GENOTYPEThe most likely genotype given the observed evidence at the SNP locus in the sequencing data.
LODThe LOD score for OBSERVED_GENOTYPE vs. the next most likely genotype in the sequencing data.
OBS_AThe number of observations of the first, or A, allele of the SNP in the sequencing data.
OBS_BThe number of observations of the second, or B, allele of the SNP in the sequencing data.

FingerprintingSummaryMetrics

Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs. a set of known genotypes for the expected sample.

FieldDescription
READ_GROUPThe read group from which sequence data was drawn for comparison.
SAMPLEThe sample whose known genotypes the sequence data was compared to.
LL_EXPECTED_SAMPLEThe Log Likelihood of the sequence data given the expected sample's genotypes.
LL_RANDOM_SAMPLEThe Log Likelihood of the sequence data given a random sample from the human population.
LOD_EXPECTED_SAMPLEThe LOD for Expected Sample vs. Random Sample. A positive LOD indicates that the sequence data is more likely to come from the expected sample vs. a random sample from the population, by LOD logs. I.e. a value of 6 indicates that the sequence data is 1,000,000 more likely to come from the expected sample than from a random sample. A negative LOD indicates the reverse - that the sequence data is more likely to come from a random sample than from the expected sample.
HAPLOTYPES_WITH_GENOTYPESThe number of haplotypes that had expected genotypes to compare to.
HAPLOTYPES_CONFIDENTLY_CHECKEDThe subset of genotyped haplotypes for which there was sufficient sequence data to confidently genotype the haplotype. Note: all haplotypes with sequence coverage contribute to the LOD score, even if they cannot be "confidently checked" individually.
HAPLOTYPES_CONFIDENTLY_MATCHINGThe subset of confidently checked haplotypes that match the expected genotypes.
HET_AS_HOMnum of hets, observed as homs with LOD > threshold
HOM_AS_HETnum of homs, observed as hets with LOD > threshold
HOM_AS_OTHER_HOMnum of homs, observed as other homs with LOD > threshold

GcBiasDetailMetrics

Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.

FieldDescription
ACCUMULATION_LEVEL
READS_USEDThis option is used to mark including or excluding duplicates.
GCThe G+C content of the reference sequence represented by this bin. Values are from 0% to 100%
WINDOWSThe number of windows on the reference genome that have this G+C content.
READ_STARTSThe number of reads whose start position is at the start of a window of this GC.
MEAN_BASE_QUALITYThe mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.
NORMALIZED_COVERAGEThe ratio of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).
ERROR_BAR_WIDTHThe radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.

GcBiasMetrics

FieldDescription
DETAILS
SUMMARY

GcBiasSummaryMetrics

High level metrics that capture how biased the coverage in a certain lane is.

FieldDescription
ACCUMULATION_LEVEL
READS_USEDThis option is used to mark including or excluding duplicates.
WINDOW_SIZEThe window size on the genome used to calculate the GC of the sequence.
TOTAL_CLUSTERSThe total number of clusters that were seen in the gc bias calculation.
ALIGNED_READSThe total number of aligned reads used to compute the gc bias metrics.
AT_DROPOUTIllumina-style AT dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[0..50].
GC_DROPOUTIllumina-style GC dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[50..100].
GC_NC_0_19Normalized coverage over quintile of GC content ranging from 0 - 19.
GC_NC_20_39Normalized coverage over each quintile of GC content ranging from 20 - 39.
GC_NC_40_59Normalized coverage over each quintile of GC content ranging from 40 - 59.
GC_NC_60_79Normalized coverage over each quintile of GC content ranging from 60 - 79.
GC_NC_80_100Normalized coverage over each quintile of GC content ranging from 80 - 100.

GenotypeConcordanceContingencyMetrics

Class that holds metrics about the Genotype Concordance contingency tables.

FieldDescription
VARIANT_TYPEThe type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLEThe name of the 'truth' sample
CALL_SAMPLEThe name of the 'call' sample
TP_COUNTThe TP (true positive) count across all variants
TN_COUNTThe TN (true negative) count across all variants
FP_COUNTThe FP (false positive) count across all variants
FN_COUNTThe FN (false negative) count across all variants
EMPTY_COUNTThe empty (no contingency info) count across all variants

GenotypeConcordanceDetailMetrics

Class that holds detail metrics about Genotype Concordance

FieldDescription
VARIANT_TYPEThe type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLEThe name of the 'truth' sample
CALL_SAMPLEThe name of the 'call' sample
TRUTH_STATEThe state of the 'truth' sample (i.e. HOM_REF, HET_REF_VAR1, HET_VAR1_VAR2...)
CALL_STATEThe state of the 'call' sample (i.e. HOM_REF, HET_REF_VAR1...)
COUNTThe number of events of type TRUTH_STATE and CALL_STATE for the EVENT_TYPE and SAMPLEs
CONTINGENCY_VALUESThe list of contingency table values (TP, TN, FP, FN) that are deduced from the truth/call state comparison, given the reference. In general, we are comparing two sets of alleles. Therefore, we can have zero or more contingency table values represented in one comparison. For example, if the truthset is a heterozygous call with both alleles non-reference (HET_VAR1_VAR2), and the callset is a heterozygous call with both alleles non-reference with one of the alternate alleles matching an alternate allele in the callset, we would have a true positive, false positive, and false negative. The true positive is from the matching alternate alleles, the false positive is the alternate allele found in the callset but not found in the truthset, and the false negative is the alternate in the truthset not found in the callset. We also include a true negative in cases where the reference allele is found in both the truthset and callset.

GenotypeConcordanceSummaryMetrics

Class that holds summary metrics about Genotype Concordance

FieldDescription
VARIANT_TYPEThe type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLEThe name of the 'truth' sample
CALL_SAMPLEThe name of the 'call' sample
HET_SENSITIVITYThe sensitivity for all heterozygous variants (Sensitivity is TP / (TP + FN))
HET_PPVThe ppv (positive predictive value) for all heterozygous variants (PPV is the TP / (TP + FP))
HET_SPECIFICITYThe specificity for all heterozygous variants cannot be calculated
HOMVAR_SENSITIVITYThe sensitivity for all homozygous variants (Sensitivity is TP / (TP + FN))
HOMVAR_PPVThe ppv (positive predictive value) for all homozygous variants (PPV is the TP / (TP + FP))
HOMVAR_SPECIFICITYThe specificity for all homozygous variants cannot be calculated.
VAR_SENSITIVITYThe sensitivity for all (heterozygous and homozygous) variants (Sensitivity is TP / (TP + FN))
VAR_PPVThe ppv (positive predictive value) for all (heterozygous and homozygous) variants (PPV is the TP / (TP + FP))
VAR_SPECIFICITYThe specificity for all (heterozygous and homozygous) variants (Specificity is TN / (FP + TN))
GENOTYPE_CONCORDANCEThe genotype concordance for all possible states. Genotype Concordance is the number of times the truth and call states match exactly / all truth and call combinations made
NON_REF_GENOTYPE_CONCORDANCEThe non-ref genotype concordance, ie for all var states only. Non Ref Genotype Concordance is the number of times the truth and call states match exactly for *vars only* / all truth and call *var* combinations made

HsMetrics

Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments. The metrics in this class fall broadly into three categories:

FieldDescription
BAIT_SETThe name of the bait set used in the hybrid selection.
GENOME_SIZEThe number of bases in the reference genome used for alignment.
BAIT_TERRITORYThe number of bases which are localized to one or more baits.
TARGET_TERRITORYThe unique number of target bases in the experiment, where the target sequence is usually exons etc.
BAIT_DESIGN_EFFICIENCYThe ratio of TARGET_TERRITORY/BAIT_TERRITORY. A value of 1 indicates a perfect design efficiency, while a valud of 0.5 indicates that half of bases within the bait region are not within the target region.
TOTAL_READSThe total number of reads in the SAM or BAM file examined.
PF_READSThe total number of reads that pass the vendor's filter.
PF_UNIQUE_READSThe number of PF reads that are not marked as duplicates.
PCT_PF_READSThe fraction of reads passing the vendor's filter, PF_READS/TOTAL_READS.
PCT_PF_UQ_READSThe fraction of PF_UNIQUE_READS from the TOTAL_READS, PF_UNIQUE_READS/TOTAL_READS.
PF_UQ_READS_ALIGNEDThe number of PF_UNIQUE_READS that aligned to the reference genome with a mapping score > 0.
PCT_PF_UQ_READS_ALIGNEDThe fraction of PF_UQ_READS_ALIGNED from the total number of PF reads.
PF_BASES_ALIGNEDThe number of PF unique bases that are aligned to the reference genome with mapping scores > 0.
PF_UQ_BASES_ALIGNEDThe number of bases in the PF_UQ_READS_ALIGNED reads. Accounts for clipping and gaps.
ON_BAIT_BASESThe number of PF_BASES_ALIGNED that are mapped to the baited regions of the genome.
NEAR_BAIT_BASESThe number of PF_BASES_ALIGNED that are mapped to within a fixed interval containing a baited region, but not within the baited section per se.
OFF_BAIT_BASESThe number of PF_BASES_ALIGNED that are mapped away from any baited region.
ON_TARGET_BASESThe number of PF_BASES_ALIGNED that are mapped to a targeted region of the genome.
PCT_SELECTED_BASESThe fraction of PF_BASES_ALIGNED located on or near a baited region (ON_BAIT_BASES + NEAR_BAIT_BASES)/PF_BASES_ALIGNED.
PCT_OFF_BAITThe fraction of PF_BASES_ALIGNED that are mapped away from any baited region, OFF_BAIT_BASES/PF_BASES_ALIGNED.
ON_BAIT_VS_SELECTEDThe fraction of bases on or near baits that are covered by baits, ON_BAIT_BASES/(ON_BAIT_BASES + NEAR_BAIT_BASES).
MEAN_BAIT_COVERAGEThe mean coverage of all baits in the experiment.
MEAN_TARGET_COVERAGEThe mean coverage of a target region.
MEDIAN_TARGET_COVERAGEThe median coverage of a target region.
MAX_TARGET_COVERAGEThe maximum coverage of reads that mapped to target regions of an experiment.
PCT_USABLE_BASES_ON_BAITThe number of aligned, de-duped, on-bait bases out of the PF bases available.
PCT_USABLE_BASES_ON_TARGETThe number of aligned, de-duped, on-target bases out of all of the PF bases available.
FOLD_ENRICHMENTThe fold by which the baited region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCTThe fraction of targets that did not reach coverage=1 over any base.
PCT_EXC_DUPEThe fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_MAPQThe fraction of aligned bases that were filtered out because they were in reads with low mapping quality.
PCT_EXC_BASEQThe fraction of aligned bases that were filtered out because they were of low base quality.
PCT_EXC_OVERLAPThe fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_OFF_TARGETThe fraction of aligned bases that were filtered out because they did not align over a target base.
FOLD_80_BASE_PENALTYThe fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_1XThe fraction of all target bases achieving 1X or greater coverage.
PCT_TARGET_BASES_2XThe fraction of all target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10XThe fraction of all target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20XThe fraction of all target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30XThe fraction of all target bases achieving 30X or greater coverage.
PCT_TARGET_BASES_40XThe fraction of all target bases achieving 40X or greater coverage.
PCT_TARGET_BASES_50XThe fraction of all target bases achieving 50X or greater coverage.
PCT_TARGET_BASES_100XThe fraction of all target bases achieving 100X or greater coverage.
HS_LIBRARY_SIZEThe estimated number of unique molecules in the selected part of the library.
HS_PENALTY_10XThe "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 10 * HS_PENALTY_10X.
HS_PENALTY_20XThe "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 20 * HS_PENALTY_20X.
HS_PENALTY_30XThe "hybrid selection penalty" incurred to get 80% of target bases to 30X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 30 * HS_PENALTY_30X.
HS_PENALTY_40XThe "hybrid selection penalty" incurred to get 80% of target bases to 40X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 40X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 40 * HS_PENALTY_40X.
HS_PENALTY_50XThe "hybrid selection penalty" incurred to get 80% of target bases to 50X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 50X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 50 * HS_PENALTY_50X.
HS_PENALTY_100XThe "hybrid selection penalty" incurred to get 80% of target bases to 100X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 100X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 100 * HS_PENALTY_100X.
AT_DROPOUTA measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUTA measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.
HET_SNP_SENSITIVITYThe theoretical HET SNP sensitivity.
HET_SNP_QThe Phred Scaled Q Score of the theoretical HET SNP sensitivity.

IlluminaBasecallingMetrics

Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis. Averages and means are taken over all tiles.

FieldDescription
LANEThe lane for which the metrics were calculated.
MOLECULAR_BARCODE_SEQUENCE_1The barcode sequence for which the metrics were calculated.
MOLECULAR_BARCODE_NAMEThe barcode name for which the metrics were calculated.
TOTAL_BASESThe total number of bases assigned to the index.
PF_BASESThe total number of passing-filter bases assigned to the index.
TOTAL_READSThe total number of reads assigned to the index.
PF_READSThe total number of passing-filter reads assigned to the index.
TOTAL_CLUSTERSThe total number of clusters assigned to the index.
PF_CLUSTERSThe total number of PF clusters assigned to the index.
MEAN_CLUSTERS_PER_TILEThe mean number of clusters per tile.
SD_CLUSTERS_PER_TILEThe standard deviation of clusters per tile.
MEAN_PCT_PF_CLUSTERS_PER_TILEThe mean percentage of pf clusters per tile.
SD_PCT_PF_CLUSTERS_PER_TILEThe standard deviation in percentage of pf clusters per tile.
MEAN_PF_CLUSTERS_PER_TILEThe mean number of pf clusters per tile.
SD_PF_CLUSTERS_PER_TILEThe standard deviation in number of pf clusters per tile.

IlluminaLaneMetrics

Embodies characteristics that describe a lane.

FieldDescription
CLUSTER_DENSITYThe number of clusters per unit area on the this lane expressed in units of [cluster / mm^2].
LANEThis lane's number.

IlluminaPhasingMetrics

Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis. Phasing refers to the fraction of molecules that fall behind or jump ahead (prephasing) during a read cycle. For each lane/template read # (i.e. FIRST, SECOND) combination we will store the median values of both the phasing and prephasing values for every tile in that lane/template read pair.

FieldDescription
LANEIllumina flowcell lane number
TYPE_NAMEDefines an Illumina template read number (first or second)
PHASING_APPLIEDMedian phasing value across all tiles in a lane, applied to the first and second template reads
PREPHASING_APPLIEDMedian pre-phasing value across all tiles in a lane, applied to the first and second template reads

IndependentReplicateMetric

A class to store information relevant for biological rate estimation

FieldDescription
nSites
nThreeAllelesSites
nTotalReads
nDuplicateSets
nExactlyTriple
nExactlyDouble
nReadsInBigSets
nDifferentAllelesBiDups
nReferenceAllelesBiDups
nAlternateAllelesBiDups
nDifferentAllelesTriDups
nMismatchingAllelesBiDups
nReferenceAllelesTriDups
nAlternateAllelesTriDups
nMismatchingAllelesTriDups
nReferenceReads
nAlternateReads
nMismatchingUMIsInDiffBiDups
nMatchingUMIsInDiffBiDups
nMismatchingUMIsInSameBiDups
nMatchingUMIsInSameBiDups
nMismatchingUMIsInCoOrientedBiDups
nMismatchingUMIsInContraOrientedBiDups
nBadBarcodes
nGoodBarcodes
biSiteHeterogeneityRate
triSiteHeterogeneityRate
biSiteHomogeneityRate
triSiteHomogeneityRate
independentReplicationRateFromBiDups
independentReplicationRateFromTriDups
pSameUmiInIndependentBiDup
pSameAlleleWhenMismatchingUmi
independentReplicationRateFromUmi
replicationRateFromReplicateSets

InsertSizeMetrics

Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics". In addition the insert size distribution is plotted to a file with the extension ".insert_size_Histogram.pdf".

FieldDescription
MEDIAN_INSERT_SIZEThe MEDIAN insert size of all paired end reads where both ends mapped to the same chromosome.
MEDIAN_ABSOLUTE_DEVIATIONThe median absolute deviation of the distribution. If the distribution is essentially normal then the standard deviation can be estimated as ~1.4826 * MAD.
MIN_INSERT_SIZEThe minimum measured insert size. This is usually 1 and not very useful as it is likely artifactual.
MAX_INSERT_SIZEThe maximum measure insert size by alignment. This is usually very high representing either an artifact or possibly the presence of a structural re-arrangement.
MEAN_INSERT_SIZEThe mean insert size of the "core" of the distribution. Artefactual outliers in the distribution often cause calculation of nonsensical mean and stdev values. To avoid this the distribution is first trimmed to a "core" distribution of +/- N median absolute deviations around the median insert size. By default N=10, but this is configurable.
STANDARD_DEVIATIONStandard deviation of insert sizes over the "core" of the distribution.
READ_PAIRSThe total number of read pairs that were examined in the entire distribution.
PAIR_ORIENTATIONThe pair orientation of the reads in this data category.
WIDTH_OF_10_PERCENTThe "width" of the bins, centered around the median, that encompass 10% of all read pairs.
WIDTH_OF_20_PERCENTThe "width" of the bins, centered around the median, that encompass 20% of all read pairs.
WIDTH_OF_30_PERCENTThe "width" of the bins, centered around the median, that encompass 30% of all read pairs.
WIDTH_OF_40_PERCENTThe "width" of the bins, centered around the median, that encompass 40% of all read pairs.
WIDTH_OF_50_PERCENTThe "width" of the bins, centered around the median, that encompass 50% of all read pairs.
WIDTH_OF_60_PERCENTThe "width" of the bins, centered around the median, that encompass 60% of all read pairs.
WIDTH_OF_70_PERCENTThe "width" of the bins, centered around the median, that encompass 70% of all read pairs. This metric divided by 2 should approximate the standard deviation when the insert size distribution is a normal distribution.
WIDTH_OF_80_PERCENTThe "width" of the bins, centered around the median, that encompass 80% of all read pairs.
WIDTH_OF_90_PERCENTThe "width" of the bins, centered around the median, that encompass 90% of all read pairs.
WIDTH_OF_99_PERCENTThe "width" of the bins, centered around the median, that encompass 100% of all read pairs.

JumpingLibraryMetrics

High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".

FieldDescription
JUMP_PAIRSThe number of outward-facing pairs in the SAM file
JUMP_DUPLICATE_PAIRSThe number of outward-facing pairs that are duplicates
JUMP_DUPLICATE_PCTThe fraction of outward-facing pairs that are marked as duplicates
JUMP_LIBRARY_SIZEThe estimated library size for outward-facing pairs
JUMP_MEAN_INSERT_SIZEThe mean insert size for outward-facing pairs
JUMP_STDEV_INSERT_SIZEThe standard deviation on the insert size for outward-facing pairs
NONJUMP_PAIRSThe number of inward-facing pairs in the SAM file
NONJUMP_DUPLICATE_PAIRSThe number of inward-facing pais that are duplicates
NONJUMP_DUPLICATE_PCTThe fraction of inward-facing pairs that are marked as duplicates
NONJUMP_LIBRARY_SIZEThe estimated library size for inward-facing pairs
NONJUMP_MEAN_INSERT_SIZEThe mean insert size for inward-facing pairs
NONJUMP_STDEV_INSERT_SIZEThe standard deviation on the insert size for inward-facing pairs
CHIMERIC_PAIRSThe number of pairs where either (a) the ends fall on different chromosomes or (b) the insert size is greater than the maximum of 100000 or 2 times the mode of the insert size for outward-facing pairs.
FRAGMENTSThe number of fragments in the SAM file
PCT_JUMPSThe number of outward-facing pairs expressed as a fraction of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_NONJUMPSThe number of inward-facing pairs expressed as a fraction of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_CHIMERASThe number of chimeric pairs expressed as a fraction of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.

MendelianViolationMetrics

Describes the type and number of mendelian violations found within a Trio.

FieldDescription
FAMILY_IDThe family ID assigned to the trio for which these metrics are calculated.
MOTHERThe ID of the mother within the trio.
FATHERThe ID of the father within the trio.
OFFSPRINGThe ID of the offspring within the trio.
OFFSPRING_SEXThe sex of the offspring.
NUM_VARIANT_SITESThe number of biallelic, SNP sites at which all relevant samples exceeded the minimum genotype quality and depth and at least one of the samples was variant.
NUM_DIPLOID_DENOVOThe number of diploid sites at which a potential de-novo mutation was observed (i.e. both parents are hom-ref, offspring is not hom-ref.
NUM_HOMVAR_HOMVAR_HETThe number of sites at which both parents are homozygous for a non-reference allele and the offspring is heterozygous.
NUM_HOMREF_HOMVAR_HOMThe number of sites at which the one parent is homozygous reference, the other homozygous variant and the offspring is homozygous.
NUM_HOM_HET_HOMThe number of sites at which one parent is homozygous, the other is heterozygous and the offspring is the alternative homozygote.
NUM_HAPLOID_DENOVOThe number of sites at which the offspring is haploid, the parent is homozygous reference and the offspring is non-reference.
NUM_HAPLOID_OTHERThe number of sites at which the offspring is haploid and exhibits a reference allele that is not present in the parent.
NUM_OTHERThe number of otherwise unclassified events.
TOTAL_MENDELIAN_VIOLATIONSThe total of all mendelian violations observed.

MergeableMetricBase

An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated. It also provides an interface for calculating derived fields (and an annotation that informs that said fields are derived). Finally, it also allows for an annotation that suggests that a field will be used as an ID and thus merging will simply require that these fields are equal. merge-by-adding is only enabled for the following types: int, Integer, float, Float, double, Double, short, Short, long, Long, byte, Byte. Overflow will be detected (for the short, and byte types) and an exception thrown.

FieldDescription

MultilevelMetrics

FieldDescription
SAMPLEThe sample to which these metrics apply. If null, it means they apply to all reads in the file.
LIBRARYThe library to which these metrics apply. If null, it means that the metrics were accumulated at the sample level.
READ_GROUPThe read group to which these metrics apply. If null, it means that the metrics were accumulated at the library or sample level.

RnaSeqMetrics

Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".

FieldDescription
PF_BASESThe total number of PF bases including non-aligned reads.
PF_ALIGNED_BASESThe total number of aligned PF bases. Non-primary alignments are not counted. Bases in aligned reads that do not correspond to reference (e.g. soft clips, insertions) are not counted.
RIBOSOMAL_BASESNumber of bases in primary alignments that align to ribosomal sequence.
CODING_BASESNumber of bases in primary alignments that align to a non-UTR coding base for some gene, and not ribosomal sequence.
UTR_BASESNumber of bases in primary alignments that align to a UTR base for some gene, and not a coding base.
INTRONIC_BASESNumber of bases in primary alignments that align to an intronic base for some gene, and not a coding or UTR base.
INTERGENIC_BASESNumber of bases in primary alignments that do not align to any gene.
IGNORED_READSNumber of primary alignments that are mapped to a sequence specified on command-line as IGNORED_SEQUENCE. These are not counted in PF_ALIGNED_BASES, CORRECT_STRAND_READS, INCORRECT_STRAND_READS, or any of the base-counting metrics. These reads are counted in PF_BASES.
CORRECT_STRAND_READSNumber of aligned reads that are mapped to the correct strand. 0 if library is not strand-specific.
INCORRECT_STRAND_READSNumber of aligned reads that are mapped to the incorrect strand. 0 if library is not strand-specific.
NUM_R1_TRANSCRIPT_STRAND_READSThe number of reads that support the model where R1 is on the strand of transcription and R2 is on the opposite strand.
NUM_R2_TRANSCRIPT_STRAND_READSThe fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite strand.
NUM_UNEXPLAINED_READSThe fraction of reads for which the transcription strand model could not be inferred.
PCT_R1_TRANSCRIPT_STRAND_READSThe fraction of reads that support the model where R1 is on the strand of transcription and R2 is on the opposite strand. For unpaired reads, it is the fraction of reads that are on the transcription strand (out of all the reads).
PCT_R2_TRANSCRIPT_STRAND_READSThe fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite strand. For unpaired reads, it is the fraction of reads that are on opposite strand than that of the the transcription strand (out of all the reads).
PCT_RIBOSOMAL_BASESFraction of PF_ALIGNED_BASES that mapped to regions encoding ribosomal RNA, RIBOSOMAL_BASES/PF_ALIGNED_BASES
PCT_CODING_BASESFraction of PF_ALIGNED_BASES that mapped to protein coding regions of genes, CODING_BASES/PF_ALIGNED_BASES
PCT_UTR_BASESFraction of PF_ALIGNED_BASES that mapped to untranslated regions (UTR) of genes, UTR_BASES/PF_ALIGNED_BASES
PCT_INTRONIC_BASESFraction of PF_ALIGNED_BASES that correspond to gene introns, INTRONIC_BASES/PF_ALIGNED_BASES
PCT_INTERGENIC_BASESFraction of PF_ALIGNED_BASES that mapped to intergenic regions of genomic DNA, INTERGENIC_BASES/PF_ALIGNED_BASES
PCT_MRNA_BASESSum of bases mapped to regions corresponding to UTRs and coding regions of mRNA transcripts, PCT_UTR_BASES + PCT_CODING_BASES
PCT_USABLE_BASESThe fraction of bases mapping to mRNA divided by the total number of PF bases, (CODING_BASES + UTR_BASES)/PF_BASES.
PCT_CORRECT_STRAND_READSFraction of reads corresponding to mRNA transcripts which map to the correct strand of a reference genome = CORRECT_STRAND_READS/(CORRECT_STRAND_READS + INCORRECT_STRAND_READS). 0 if library is not strand-specific.
MEDIAN_CV_COVERAGEThe median coefficient of variation (CV) or stdev/mean for coverage values of the 1000 most highly expressed transcripts. Ideal value = 0.
MEDIAN_5PRIME_BIASThe median 5 prime bias of the 1000 most highly expressed transcripts. The 5 prime bias is calculated per transcript as: mean coverage of the 5 prime-most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_3PRIME_BIASThe median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3 prime-most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_5PRIME_TO_3PRIME_BIASThe ratio of coverage at the 5 prime end to the 3 prime end based on the 1000 most highly expressed transcripts.

RrbsCpgDetailMetrics

Holds information about CpG sites encountered for RRBS processing QC

FieldDescription
SEQUENCE_NAMESequence the CpG is seen in
POSITIONPosition within the sequence of the CpG site
TOTAL_SITESNumber of times this CpG site was encountered
CONVERTED_SITESNumber of times this CpG site was converted (TG for + strand, CA for - strand)
PCT_CONVERTEDCpG CONVERTED_BASES / CpG TOTAL_BASES (fraction)

RrbsSummaryMetrics

Holds summary statistics from RRBS processing QC

FieldDescription
READS_ALIGNEDNumber of mapped reads processed
NON_CPG_BASESNumber of times a non-CpG cytosine was encountered
NON_CPG_CONVERTED_BASESNumber of times a non-CpG cytosine was converted (C->T for +, G->A for -)
PCT_NON_CPG_BASES_CONVERTEDNON_CPG_CONVERTED_BASES / NON_CPG_BASES (fraction)
CPG_BASES_SEENNumber of CpG sites encountered
CPG_BASES_CONVERTEDNumber of CpG sites that were converted (TG for +, CA for -)
PCT_CPG_BASES_CONVERTEDCPG_BASES_CONVERTED / CPG_BASES_SEEN (fraction)
MEAN_CPG_COVERAGEMean coverage of CpG sites
MEDIAN_CPG_COVERAGEMedian coverage of CpG sites
READS_WITH_NO_CPGNumber of reads discarded for having no CpG sites
READS_IGNORED_SHORTNumber of reads discarded due to being too short
READS_IGNORED_MISMATCHESNumber of reads discarded for exceeding the mismatch threshold

SequencingArtifactMetrics.BaitBiasDetailMetrics

Bait bias artifacts broken down by context.

FieldDescription
SAMPLE_ALIAS
LIBRARYThe name of the library being assayed.
REF_BASEThe (upper-case) original base on the reference strand.
ALT_BASEThe (upper-case) alternative base that is called as a result of DNA damage.
CONTEXTThe sequence context to which the analysis is constrained.
FWD_CXT_REF_BASESThe number of REF_BASE:REF_BASE alignments at sites with the given reference context.
FWD_CXT_ALT_BASESThe number of REF_BASE:ALT_BASE alignments at sites with the given reference context.
REV_CXT_REF_BASESThe number of ~REF_BASE:~REF_BASE alignments at sites complementary to the given reference context.
REV_CXT_ALT_BASESThe number of ~REF_BASE:~ALT_BASE alignments at sites complementary to the given reference context.
FWD_ERROR_RATEThe substitution rate of REF_BASE:ALT_BASE, calculated as max(1e-10, FWD_CXT_ALT_BASES / (FWD_CXT_ALT_BASES + FWD_CXT_REF_BASES)).
REV_ERROR_RATEThe substitution rate of ~REF_BASE:~ALT_BASE, calculated as max(1e-10, REV_CXT_ALT_BASES / (REV_CXT_ALT_BASES + REV_CXT_REF_BASES)).
ERROR_RATEThe bait bias error rate, calculated as max(1e-10, FWD_ERROR_RATE - REV_ERROR_RATE).
QSCOREThe Phred-scaled quality score of the artifact, calculated as -10 * log10(ERROR_RATE).

SequencingArtifactMetrics.BaitBiasSummaryMetrics

Summary analysis of a single bait bias artifact, also known as a reference bias artifact. These artifacts occur during or after the target selection step, and correlate with substitution rates that are "biased", or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, a G>T artifact during the target selection step might result in a higher G>T / C>A substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive / G negative). This is known as the "G-Ref" artifact.

FieldDescription
SAMPLE_ALIASThe name of the sample being assayed.
LIBRARYThe name of the library being assayed.
REF_BASEThe (upper-case) original base on the reference strand.
ALT_BASEThe (upper-case) alternative base that is called as a result of DNA damage.
TOTAL_QSCOREThe total Phred-scaled Q-score for this artifact. A lower Q-score means a higher probability that a REF_BASE:ALT_BASE observation randomly picked from the data will be due to this artifact, rather than a true variant.
WORST_CXTThe sequence context (reference bases surrounding the locus of interest) having the lowest Q-score among all contexts for this artifact.
WORST_CXT_QSCOREThe Q-score for the worst context.
WORST_PRE_CXTThe pre-context (reference bases leading up to the locus of interest) with the lowest Q-score.
WORST_PRE_CXT_QSCOREThe Q-score for the worst pre-context.
WORST_POST_CXTThe post-context (reference bases trailing after the locus of interest) with the lowest Q-score.
WORST_POST_CXT_QSCOREThe Q-score for the worst post-context.
ARTIFACT_NAMEA "nickname" of this artifact, if it is a known error mode.

SequencingArtifactMetrics.PreAdapterDetailMetrics

Pre-adapter artifacts broken down by context.

FieldDescription
SAMPLE_ALIASThe name of the sample being assayed.
LIBRARYThe name of the library being assayed.
REF_BASEThe (upper-case) original base on the reference strand.
ALT_BASEThe (upper-case) alternative base that is called as a result of DNA damage.
CONTEXTThe sequence context to which the analysis is constrained.
PRO_REF_BASESThe number of REF_BASE:REF_BASE alignments having a read number and orientation that supports the presence of this artifact.
PRO_ALT_BASESThe number of REF_BASE:ALT_BASE alignments having a read number and orientation that supports the presence of this artifact.
CON_REF_BASESThe number of REF_BASE:REF_BASE alignments having a read number and orientation that refutes the presence of this artifact.
CON_ALT_BASESThe number of REF_BASE:ALT_BASE alignments having a read number and orientation that refutes the presence of this artifact.
ERROR_RATEThe estimated error rate due to this artifact. Calculated as max(1e-10, (PRO_ALT_BASES - CON_ALT_BASES) / (PRO_ALT_BASES + PRO_REF_BASES + CON_ALT_BASES + CON_REF_BASES)).
QSCOREThe Phred-scaled quality score of the artifact, calculated as -10 * log10(ERROR_RATE).

SequencingArtifactMetrics.PreAdapterSummaryMetrics

Summary analysis of a single pre-adapter artifact. These artifacts occur on the original template strand, before the addition of adapters, so they correlate with read number / orientation in a specific way. For example, the well-known "Oxo-G" artifact occurs when a G on the template strand is oxidized, giving it an affinity for binding to A rather than the usual C. Thus PCR will introduce apparent G>T substitutions in read 1 and C>A in read 2. In the resulting alignments, a given G>T or C>A observation could either be: 1. a true mutation 2. an OxoG artifact 3. some other kind of artifact On average, we assume that 1 and 3 will not display this read number / orientation bias, so their contributions will cancel out in the calculation.

FieldDescription
SAMPLE_ALIASThe name of the sample being assayed.
LIBRARYThe name of the library being assayed.
REF_BASEThe (upper-case) original base on the reference strand.
ALT_BASEThe (upper-case) alternative base that is called as a result of DNA damage.
TOTAL_QSCOREThe total Phred-scaled Q-score for this artifact. A lower Q-score means a higher probability that a REF_BASE:ALT_BASE observation randomly picked from the data will be due to this artifact, rather than a true variant.
WORST_CXTThe sequence context (reference bases surrounding the locus of interest) having the lowest Q-score among all contexts for this artifact.
WORST_CXT_QSCOREThe Q-score for the worst context.
WORST_PRE_CXTThe pre-context (reference bases leading up to the locus of interest) with the lowest Q-score.
WORST_PRE_CXT_QSCOREThe Q-score for the worst pre-context.
WORST_POST_CXTThe post-context (reference bases trailing after the locus of interest) with the lowest Q-score.
WORST_POST_CXT_QSCOREThe Q-score for the worst post-context.
ARTIFACT_NAMEA "nickname" of this artifact, if it is a known error mode.

TargetedPcrMetrics

Metrics class for the analysis of reads obtained from targeted pcr experiments e.g. the TruSeq Custom Amplicon (TSCA) kit (Illumina).

FieldDescription
CUSTOM_AMPLICON_SETThe name of the amplicon set used in this metrics collection run
GENOME_SIZEThe number of bases in the reference genome used for alignment
AMPLICON_TERRITORYThe number of unique bases covered by the intervals of all amplicons in the amplicon set
TARGET_TERRITORYThe number of unique bases covered by the intervals of all targets that should be covered
TOTAL_READSThe total number of reads in the SAM or BAM file examined
PF_READSThe total number of reads passing filter (PF), where the filter(s) can be platform/vendor quality controls
PF_BASESThe total number of bases within the PF_READS of the SAM or BAM file to be examined
PF_UNIQUE_READSThe number of PF_READS that were not marked as sample or optical duplicates.
PCT_PF_READSThe fraction of reads passing filter, PF_READS/TOTAL_READS.
PCT_PF_UQ_READSThe fraction of TOTAL_READS that are unique, PF, and are not duplicates, PF_UNIQUE_READS/TOTAL_READS
PF_UQ_READS_ALIGNEDThe total number of PF_UNIQUE_READS that align to the reference genome with mapping scores > 0
PF_SELECTED_PAIRSTracks the number of PF read pairs (used to calculate library size)
PF_SELECTED_UNIQUE_PAIRSTracks the number of unique, PF, read pairs, observed (used to calculate library size)
PCT_PF_UQ_READS_ALIGNEDFraction of PF_READS that are unique and align to the reference genome, PF_UQ_READS_ALIGNED/PF_READS
PF_BASES_ALIGNEDThe number of bases from PF_READS that align to the reference genome with mapping score > 0
PF_UQ_BASES_ALIGNEDThe number of bases from PF_UNIQUE_READS that align to the reference genome and have a mapping score > 0
ON_AMPLICON_BASESThe number of PF_BASES_ALIGNED that mapped to an amplified region of the genome.
NEAR_AMPLICON_BASESThe number of PF_BASES_ALIGNED that mapped to within a fixed interval of an amplified region, but not on a baited region.
OFF_AMPLICON_BASESThe number of PF_BASES_ALIGNED that mapped neither on or near an amplicon.
ON_TARGET_BASESThe number of PF_BASES_ALIGNED that mapped to a targeted region of the genome.
ON_TARGET_FROM_PAIR_BASESThe number of bases from PF_SELECTED_UNIQUE_PAIRS that mapped to a targeted region of the genome.
PCT_AMPLIFIED_BASESThe fraction of PF_BASES_ALIGNED that mapped to or near an amplicon, (ON_AMPLICON_BASES + NEAR_AMPLICON_BASES)/PF_BASES_ALIGNED.
PCT_OFF_AMPLICONThe fraction of PF_BASES_ALIGNED that mapped neither onto or near an amplicon, OFF_AMPLICON_BASES/PF_BASES_ALIGNED
ON_AMPLICON_VS_SELECTEDThe fraction of bases mapping to regions on or near amplicons, which mapped directly to but not near amplicons, ON_AMPLICON_BASES/(NEAR_AMPLICON_BASES + ON_AMPLICON_BASES)
MEAN_AMPLICON_COVERAGEThe mean read coverage of all amplicon regions in the experiment.
MEAN_TARGET_COVERAGEThe mean read coverage of all target regions in an experiment.
MEDIAN_TARGET_COVERAGEThe median coverage of reads that mapped to target regions of an experiment.
MAX_TARGET_COVERAGEThe maximum coverage of reads that mapped to target regions of an experiment.
FOLD_ENRICHMENTThe fold by which the amplicon region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCTThe fraction of targets that did not reach coverage=1 over any base.
PCT_EXC_DUPEThe fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_MAPQThe fraction of aligned bases that were filtered out because they were in reads with low mapping quality.
PCT_EXC_BASEQThe fraction of aligned bases that were filtered out because they were of low base quality.
PCT_EXC_OVERLAPThe fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_OFF_TARGETThe fraction of bases that were filtered out because they did not map to a base within a target region.
FOLD_80_BASE_PENALTYThe fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_1XThe fraction of all target bases achieving 1X or greater coverage.
PCT_TARGET_BASES_2XThe fraction of all target bases achieving 2X or greater coverage depth.
PCT_TARGET_BASES_10XThe fraction of all target bases achieving 10X or greater coverage depth.
PCT_TARGET_BASES_20XThe fraction of all target bases achieving 20X or greater coverage depth.
PCT_TARGET_BASES_30XThe fraction of all target bases achieving 30X or greater coverage depth.
AT_DROPOUTA measure of how regions with low GC content (<= 50%), are undercovered relative to mean coverage. After binning the GC content [0..50], we calculate a = fraction of target territory, and b = fraction of aligned reads aligned to these targets for each bin. AT DROPOUT is then abs(sum(a-b when a-b < 0)). For example, if the AT_DROPOUT value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions, mapped elsewhere.
GC_DROPOUTA measure of how regions of high GC content (>= 50% GC) are undercovered relative to the mean coverage value. For each GC bin [50..100], we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). For example, if the value is 5%, this implies that 5% of total reads that should have mapped to GC>=50% regions, mapped elsewhere.
HET_SNP_SENSITIVITYThe theoretical HET SNP sensitivity.
HET_SNP_QThe Q Score of the theoretical HET SNP sensitivity.

UmiMetrics

Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.

FieldDescription
MEAN_UMI_LENGTHNumber of bases in each UMI
OBSERVED_UNIQUE_UMISNumber of different UMI sequences observed
INFERRED_UNIQUE_UMISNumber of different inferred UMI sequences derived
OBSERVED_BASE_ERRORSNumber of errors inferred by comparing the observed and inferred UMIs
DUPLICATE_SETS_IGNORING_UMINumber of duplicate sets found before taking UMIs into account
DUPLICATE_SETS_WITH_UMINumber of duplicate sets found after taking UMIs into account
OBSERVED_UMI_ENTROPYEntropy (in base 4) of the observed UMI sequences, indicating the effective number of bases in the UMIs. If this is significantly smaller than UMI_LENGTH, it indicates that the UMIs are not distributed uniformly.
INFERRED_UMI_ENTROPYEntropy (in base 4) of the inferred UMI sequences, indicating the effective number of bases in the inferred UMIs. If this is significantly smaller than UMI_LENGTH, it indicates that the UMIs are not distributed uniformly.
UMI_BASE_QUALITIESEstimation of Phred scaled quality scores for UMIs