Skip to main content

Illumina Genotyping Array Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
Version 1.12.17March, 2024Elizabeth KiernanPlease file GitHub issues in warp or contact the WARP team

The Illumina Genotyping Array Pipeline

Introduction to the Illumina Genotyping Array pipeline

The Illumina Genotyping Array pipeline was developed by the Broad Pipelines team to process Illumina genotyping array data in the form of IDAT files. Overall, the pipeline performs gender-specific genotyping, sample contamination detection, and summary metric collection. It optionally performs rare variant calling and genotype concordance, creates a fingerprint VCF that can be used for sample verification in parallel processes, and evaluates an existing sample fingerprint to confirm sample identity. The pipeline outputs annotated VCFs, index files, and summary metrics.

Set-up

Workflow installation and requirements

The Illumina Genotyping Array workflow is written in the Workflow Description Language WDL and can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

For the latest workflow version and release notes, please see the Illumina Genotyping Array changelog IlluminaGenotypingArray.changelog.md.

The latest release of the workflow, example data, and dependencies are available from the WARP releases page. To discover and search releases, use the WARP command-line tool Wreleaser.

Inputs

The workflow requires that each input is specified in a JSON file. All sample and reference files should be stored in cloud storage, such as a Google Bucket, and the path to the file location should be specified in the JSON.

Some tasks, such as rare variant calling, fingerprinting, and genotype concordance, are optional.

An example of all workflow inputs (both required and optional) can be found in the example.json file. The tables below describe the workflow inputs, their formats, and whether they are required or optional.

Sample data inputs

Input nameDescriptionInput format
chip_well_barcodeUnique identifier of the array chip section on which sample was run. Can be a concatenation of chip barcode and sample coordinates (Ex: 7991775143_R01C01). This is a prefix for output files.String
sample_aliasName of sample run on barcode-specific chip segment.String
red_idat_cloud_pathCloud path to the red IDAT file.String
green_idat_cloud_pathCloud path to the green IDAT file.String
reported_genderReported sample sex/gender.String

Chip metadata inputs

Input nameDescriptionRequired or optionalInput format
bead_pool_manifest_fileCloud path to the Illumina array-specific bead pool manifest (bpm file).RequiredString
extended_chip_manifest_fileCloud path to an extended CSV-formatted Illumina array-specific chip manifest.RequiredString
cluster_fileCloud path to the Illumina array-specific cluster EGT file.RequiredString
gender_cluster_fileCloud path to the Illumina array-specific gender cluster EGT file.OptionalString
zcall_thresholds_fileCloud path to a thresholds file to be used with the rare variant calling tool zCall.OptionalString

Reference data inputs

Input nameDescriptionRequired or optionalInput format
ref_fastaCloud path to the reference FASTA (only validated for Hg19; view the file in the public Google repository).RequiredString
ref_fasta_indexCloud path to the reference FASTA index file.RequiredString
ref_dictCloud path to the reference FASTA dictionary file.RequiredString
dbSNP_vcfCloud path to the dbSNP VCF, used for metrics collection.RequiredString
dbSNP_vcf_indexCloud path of the dbSNP VCF index file.RequiredString

Fingerprinting inputs

Input nameDescriptionRequired or optionalInput format
variant_rsids_fileCloud path to file containing the rsids list for fingerprint sites (used for creating a new fingerprint VCF).OptionalString
fingerprint_genotypes_vcf_fileCloud path to a 'fingerprint' VCF (containing a set of genotypes - typically generated on an orthogonal genotyping platform). Used to confirm sample identity.OptionalString
fingerprint_genotypes_vcf_index_fileCloud path to 'fingerprint' VCF index file. Used to confirm sample identity.OptionalString
haplotype_database_fileCloud path to a file containing the haplotype block information for the 'fingerprint' VCF sites. Used to confirm sample identity.OptionalString

Genotype concordance inputs

Input nameDescriptionRequired or optionalInput format
control_sample_vcf_fileCloud path to a VCF containing the genotypes from a control sample.OptionalString
control_sample_vcf_index_fileCloud path to the control sample VCF index file.OptionalString
control_sample_intervals_fileCloud path to the interval list for the control sample VCF.OptionalString
control_sample_nameName of the control sample listed in the control_sample_vcf_file.OptionalString
genotype_concordance_thresholdMinimal numeric value for sample to have a passing genotype concordance. Default is hardcoded in the workflow to 0.95.OptionalValue

Additional inputs

Input nameDescriptionRequired or optionalInput format
autocall_versionThe version of autocall used for the autocall task. This is hardcoded to “3.0.0”.RequiredString
call_rate_thresholdMinimal numeric value for a sample to have a passing call rate.RequiredValue
minor_allele_frequency_fileCloud path to a chip-specific text file containing locus-id to minor allele frequency.OptionalString
contamination_controls_vcfCloud path to a VCF of samples run on this chip type to be used to supplement contamination calling.OptionalString
subsampled_metrics_interval_listCloud path to a file containing a subset of sites for which the workflow generates metrics and outputs a VCF.OptionalString
disk_sizeDefault disk (in GiB) for this workflow's cloud VMs.RequiredValue
premptible_triesNumber of times a task may be preempted by GCE before it is submitted to a non-preemptible VM.RequiredValue
analysis_version_numberNumeric value used to track the number of pipeline runs on the chip_well_barcode.RequiredValue

Workflow Tools and Tasks

The Illumina Genotyping Array workflow imports a series of tasks from the IlluminaGenotypingArrayTasks.wdl. The following sections summarize the tasks and software tools the workflow uses, as well as the relevant inputs and outputs.

Tools

The following table provides a summary of the WDL tasks and software tools called by the Illumina Array Genotyping workflow. Note that sometimes a task and tool have the same name. The task refers to the task listed in the IlluminaGenotypingArrayTasks.wdl.

TaskToolSource
Autocalliaap-cli gencallIllumina
GtcToVcfGtcToVcfPicard
BafRegressBafRegresshttps://genome.sph.umich.edu/wiki/File:BafRegress.tar.gz
VcfToAdpcVcfToAdpcPicard
VerifyIDIntensityVerifyIDIntensityhttps://github.com/gjun/verifyIDintensity
CreateVerifyIDIntensityContaminationMetricsFileCreateVerifyIDIntensityContaminationMetricsFilePicard
zCallzCallPython
MergePedIntoVcfMergePedIntoVcfPicard
CollectArraysVariantCallingMetricsCollectArraysVariantCallingMetricsPicard
SubsetArrayVCFSubsetArrayVCFGATK
CollectArraysVariantCallingMetricsCollectArraysVariantCallingMetricsPicard
SelectVariantsSelectVariantsGATK
CheckFingerprintTaskCheckFingerprintPicard
VcfToIntervalListVcfToIntervalListPicard
GenotypeConcordanceGenotypeConcordancePicard

Task summary

Overall, the workflow imports the "IlluminaGenotypingArrayTasks.wdl" task to:

  1. Perform genotype calling
  2. Detect contamination
  3. Call rare variants (optional)
  4. Collect metrics
  5. Create a new fingerprint output (optional)
  6. Evaluate an existing fingerprint (optional)
  7. Evaluate genotype concordance with a control sample (optional)

1. Genotype calling

Illumina BeadChip Genotyping technology demarcates small-nucleotide variants (and polymorphisms- SNPs) present on an individual's alleles by labeling them with red and green fluorescent signals. These signals are quantified and stored in red and green IDAT files, which the Illumina Genotyping Array workflow can use to call genotypes. To do this, the workflow requires the Illumina array-specific chip well barcode, an Illumina bead pool manifest file containing a list of all SNPs on the BeadChip, and an Illumina cluster file containing the fluorescent signal intensity cut-offs for a given genotype. These are used as inputs for the workflow's Autocall task, which executes the iaap- gencall tool, a command-line implementation of the Illumina genotyping software GenCall. This tool normalizes each SNP's red and green fluorescent signals and then clusters each SNP into one of the three genotype possibilities (i.e. homozygous A, homozygous B, or heterozygous- A/B). The overall task generates a genotype call file (GTC) as output. If the GTC file is empty, the workflow will end.

Gender-specific genotype calling (optional)

If a gender_cluster_file is listed in the input JSON, the Illumina Genotyping workflow will repeat the Autocall task to get a more accurate gender call. It produces an additional GTC file that is only used to call gender and is not included in the final output VCF.

After genotyping, the workflow calls the GtcToVcf task, which runs the Picard tool GtcToVcf to convert the GTC into a VCF.

2. Contamination detection

Intra-species DNA contamination is a common problem for genotyping samples.

The Illumina Array workflow historically uses two tools to detect contamination: BafRegress and VerifyIDIntensity. However, the use of VerifyIDIntensity is deprecated as it can overestimate estimated contamination when used in single-sample mode (as it is run typically)

BafRegress software uses a regression model to detect and estimate sample contamination from the Illumina Genotyping Arrays B allele frequency data. It requires a file formatted as an Illumina Final Report. The software is implemented in the workflow’s BafRegress task which first creates the Illumina Final Report from the VCF generated by GtcToVcf and then runs the BafRegress tool. The output of the BafRegress task is a text file containing the estimated contamination along with associated metrics.

VerifyIDIntensity requires an 'adpc.bin' file (a binary file containing array intensity data that can be used with Illumina software) as input. The workflow first calls the VcfToAdpc task to convert the VCF output from genotype calling into an 'adpc.bin' file. Next, the VerifyIDIntensity task uses this input file to measure contamination. The CreateVerifyIDIntensityContaminationMetricsFile task then converts the VerifyIDIntensity output into a Picard-standard metrics file (chip_well_barcode.verifyidintensity_metrics), suitable for uploading to a metrics database.

3. Rare variant calling (optional)

After running default genotype processing with Autocall, the Illumina Genotyping Array workflow optionally uses the zCall task to improve calls on rare variants. To run this task, the workflow requires a zCall threshold file. If the workflow identifies the file, it will output a PLINK .ped and .map file. The MergePedIntoVcf task then merges these outputs into the VCF generated during genotype calling.

4. Metric collection

Quality metrics can be assessed using the genotyping output VCF (from the GtctoVcf task) or alternatively, a subset of the VCF. The CollectArraysVariantCallingMetrics task calls the Picard tool CollectArraysVariantCallingMetrics to generate these metrics.

Metric collection on a VCF subset (optional)

If an input interval list is provided (the subsampled_metrics_interval_list file in the sample JSON), the workflow will run the optional task SubsetArrayVCF which uses the GATK tool SelectVariants to select variants within the specified interval. The overall workflow will then run the CollectArraysVariantCallingMetrics to generate metrics.

5. Creating a new fingerprint output (optional)

DNA fingerprinting helps maintain sample identity and avoid sample swaps. The Illumina Genotyping Array workflow can optionally create a new fingerprint VCF output that can be used to verify sample identity if the sample is used for additional applications (downstream sequencing, etc.). To do this, the SelectFingerprintVariants task uses a reference SNP identifier file (rsid) to run the GATK tool SelectVariants. This tool selects variants in the genotyping output VCF based on the variants present in the rsids file. The task then outputs a new subsetted fingerprint VCF and index file.

6. Evaluating an existing fingerprint (optional)

If the genotyping sample already has a corresponding fingerprint VCF file, the workflow can also optionally check the existing fingerprint to confirm sample identity. It uses the CheckFingerprintTask task to calculate genotype concordance between the workflow’s genotyping output VCF (final_output_vcf) and the known genotype specified in a fingerprint_genotypes_vcf_file. The workflow returns a boolean for if the sample genotype failed concordance, as well as a Logarithm of Odds (LOD) score for concordance.

7. Genotype concordance (optional)

If control inputs (VCF, index, and sample_name string) are provided, the workflow can examine genotype concordance. First, the VcfToIntervalList task uses the Picard tool VcfToIntervalList to prepare an interval list from the genotype calling output VCF (produced with the above GtcToVCF task. Next, the SelectVariantsForGenotypeConcordance task runs the GATK tool SelectVariants to exclude filtered sites from the VCF that will be used for GenotypeConcordance. Lastly, the GenotypeConcordance task uses the Picard tool GenotypeConcordance to calculate genotype concordance between the workflow’s output VCF and the control VCF.

Workflow outputs

The tables below summarize all of the workflow's output according to the task. Outputs from optional tasks are marked as optional. If the workflow output is written to a file, the file format is listed. Otherwise, the file format is listed as N/A. Note that file outputs are named with the chip_well_barcode prefix.

For more information on the VCF output, see the pipeline's VCF Overview.

Genotype calling and metric outputs

Output nameDescriptionRequired or optionalOutput file format
chip_well_barcode_outputString that matches the chip_well_barcode input variable.RequiredString
<chip_well_barcode>.vcf.gzVCF generated by the pipeline.RequiredCompressed VCF (vcf.gz)
<chip_well_barcode>.vcf.gz.tbiIndex file of the VCF generated by the pipeline.Requiredtabix index (vcf.gz.tbi)
<chip_well_barcode>.gtcGTC file generated by Autocall.RequiredGTC
<chip_well_barcode>.bafregress_metricsText output file generated by BafRegress.OptionalTXT
<chip_well_barcode>.verifyidintensity_metricsFile containing metrics generated by VerifyIDIntensity.RequiredTXT
<chip_well_barcode>.arrays_variant_calling_detail_metricsDetailed metrics file for the output VCF generated by CollectArraysVariantCallingMetrics.detail_metrics.RequiredTXT
<chip_well_barcode>.arrays_variant_calling_summary_metricsSummary metrics file for the output VCF as generated by CollectArraysVariantCallingMetrics.RequiredTXT
<chip_well_barcode>.arrays_control_code_summary_metricsControl code metrics file for the output VCF as generated by CollectArraysVariantCallingMetrics.RequiredTXT
<chip_well_barcode_subset>.arrays_variant_calling_detail_metricsDetailed metrics file for the subsetted VCF as generated by CollectArraysVariantCallingMetrics.OptionalTXT
<chip_well_barcode_subset>.arrays_subset_variant_calling_summary_metricsSummary metrics file for the subsetted VCF as generated by CollectArraysVariantCallingMetrics.OptionalTXT
<chip_well_barcode_subset>.arrays_subset_variant_calling_control_metricsControl code metrics file for the subsetted VCF as generated by CollectArraysVariantCallingMetrics.OptionalTXT
analysis_version_number_outputNumeric value used to track the number of pipeline runs on the chip_well_barcode; same value as the analysis_version_number input.RequiredValue
<chip_well_barcode>.results.txtBafregress results file.OptionalTXT

Fingerprinting outputs

Output nameDescriptionRequired or optionalOutput file format
<chip_well_barcode>.fingerprint.vcf.gzVCF containing genotypes selected from the output_vcf at certain designated sites.OptionalCompressed VCF (vcf.gz)
<chip_well_barcode>.fingerprint.vcf.gz.tbiIndex file of the output_fingerprint_vcf.Optionaltabix index (vcf.gz.tbi)
check_fingerprint_lodLOD score as calculated by CheckFingerprint between the pipeline output VCF and the optionally specified input fingerprint VCF.OptionalN/A
<chip_well_barcode>.fingerprinting_summary_metricsSummary metrics as calculated by CheckFingerprint between the pipeline output VCF and the optionally specified input fingerprint VCF.OptionalTXT
<chip_well_barcode>.fingerprinting_detail_metricsDetail metrics as calculated by CheckFingerprint between the pipeline output VCF and the optionally specified input fingerprint VCF.OptionalTXT

Genotype concordance outputs

Output nameDescriptionRequired or optionalOutput format
genotype_concordance_failedBoolean flag to indicate whether the genotype concordance check between the pipeline output VCF and the optionally specified input control VCF failed.OptionalN/A
<chip_well_barcode>.genotype_concordance_summary_metricsSummary metrics as calculated by GenotypeConcordance between the pipeline output VCF and the optionally specified control VCF.OptionalTXT
<chip_well_barcode>.genotype_concordance_detail_metricsDetail metrics as calculated by GenotypeConcordance between the pipeline output VCF and the optionally specified control VCF.OptionalTXT
<chip_well_barcode>.genotype_concordance_contingency_metricsContingency metrics as calculated by GenotypeConcordance between the pipeline output VCF and the optionally specified control VCF.OptionalTXT

Versioning

All Illumina Genotyping Array workflow releases are documented in the workflow changelog.

Try the pipeline in Terra

The Illumina Genotyping Array Pipeline is available on the cloud-based platform Terra. If you have a Terra account, you can access the Featured Workspace using this address: https://app.terra.bio/#workspaces/warp-pipelines/Illumina-Genotyping-Array. The workspace is preloaded with instructions and sample data. For more information on using the Terra platform, please view the Support Center.

Citing the Illumina Genotyping Array Pipeline

If you use the Illumina Genotyping Array Pipeline in your research, please cite our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Feedback and questions

Please help us make our tools better by contacting the WARP team for pipeline-related suggestions or questions.