Run Relatedness
| Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
|---|---|---|---|
| aou_9.1.0 | October, 2025 | WARP Pipelines | File an issue |
Introduction to the Run Relatedness workflow
run_relatedness is a WDL workflow that computes relatedness across cohort samples and identifies a maximal independent set of samples that can be removed to reduce relatedness confounding in downstream analyses.
The workflow submits a Hail-based relatedness job to a Dataproc cluster using a user-provided submission script and PCA-informed inputs, then runs a second task to compute a maximal independent set from related sample pairs. It returns both pairwise relatedness results and a list of flagged samples.
Quickstart table
| Pipeline Feature | Description | Source |
|---|---|---|
| Analysis type | Cohort relatedness estimation and sample deconfounding | |
| Workflow language | WDL 1.0 | openWDL |
| Data input file format | VCF + PCA scores + Dataproc submission script | |
| Data output file format | TSV relatedness matrix and flagged sample list | |
| Primary software | Hail + Google Dataproc | Hail, Dataproc |
Set-up
Run Relatedness installation and requirements
The workflow code can be downloaded by cloning the WARP GitHub repository. For the latest release, please see the run_relatedness changelog.
The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.
Inputs
Input descriptions
| Input variable name | Description | Type |
|---|---|---|
vcf_url | Path to cohort VCF with genotypes used for relatedness. | String |
pca_scores_url | Path to PCA scores corresponding to samples in vcf_url. | String |
task_identifier | Identifier used as output filename prefix. | String |
statistics | Statistic to compute. Default: 'kin'. | String |
min_individual_maf | Minimum individual-specific minor allele frequency. Default: 0.01. | Float |
block_size | Block matrix size used by the algorithm. Default: 2048. | Int |
min_kinship | Minimum kinship threshold for reported sample pairs. Default: 0.1. | Float |
min_partitions | Minimum number of partitions used for optimization. Default: 1200. | Int |
gcs_output_url | GCS path for relatedness pipeline outputs. | String |
executor_cores | Spark executor core count. | String |
driver_cores | Spark driver core count. | String |
executor_memory | Spark executor memory setting. | String |
driver_memory | Spark driver memory setting. | String |
reference_genome | Reference genome identifier (e.g., hg38). | String |
max_idle | Dataproc cluster max idle time in minutes. Default: 60. | Int |
max_age | Dataproc cluster max age in minutes. Default: 1440. | Int |
num_workers | Number of Hail Dataproc workers. | Int |
gcs_project | Google Cloud project ID used for Dataproc. | String |
gcs_subnetwork_name | Subnetwork name for Dataproc networking. Default: 'subnetwork'. | String |
submission_script | Python script submitted to Dataproc to compute relatedness. | File |
region | Dataproc region. Default: us-central1. | String |
hail_docker | Docker image for Dataproc orchestration task. | String |
hail_docker_maximal_independent_set | Docker image for maximal independent set task. | String |
Run Relatedness tasks and tools
The workflow runs two tasks: one for cluster-based relatedness computation and one for maximal independent set filtering.
To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script.
| Task name and WDL link | Tool | Software | Description |
|---|---|---|---|
| run_relatedness_task | Hail + Dataproc | us.gcr.io/broad-dsde-methods/lichtens/hail_dataproc_wdl:1.1 | Creates Dataproc cluster, submits relatedness computation job, and copies back relatedness TSV. |
| run_maximal_independent_set | Hail graph methods | hailgenetics/hail:0.2.67 | Computes maximal independent set from related pairs and exports flagged samples. |
1. Compute pairwise relatedness on Dataproc
run_relatedness_task provisions a temporary Dataproc cluster, submits the user-specified script with cohort inputs, and copies <task_identifier>_relatedness.tsv from cluster staging storage.
2. Compute maximal independent set
run_maximal_independent_set reads the relatedness table and applies hl.maximal_independent_set to identify a sample subset for removal, exporting the result to <task_identifier>.relatedness_flagged_samples.tsv.
Outputs
| Output variable name | Filename, if applicable | Output format and description |
|---|---|---|
relatedness | <task_identifier>_relatedness.tsv | Pairwise relatedness output table containing sample-pair relatedness values. |
relatedness_flagged_samples | <task_identifier>.relatedness_flagged_samples.tsv | Table of samples flagged by maximal independent set for potential exclusion. |
Versioning
All run_relatedness releases are documented in the changelog.
Feedback
Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.