Skip to main content

Run Relatedness

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
aou_9.1.0October, 2025WARP PipelinesFile an issue

Introduction to the Run Relatedness workflow

run_relatedness is a WDL workflow that computes relatedness across cohort samples and identifies a maximal independent set of samples that can be removed to reduce relatedness confounding in downstream analyses.

The workflow submits a Hail-based relatedness job to a Dataproc cluster using a user-provided submission script and PCA-informed inputs, then runs a second task to compute a maximal independent set from related sample pairs. It returns both pairwise relatedness results and a list of flagged samples.

Quickstart table

Pipeline FeatureDescriptionSource
Analysis typeCohort relatedness estimation and sample deconfounding
Workflow languageWDL 1.0openWDL
Data input file formatVCF + PCA scores + Dataproc submission script
Data output file formatTSV relatedness matrix and flagged sample list
Primary softwareHail + Google DataprocHail, Dataproc

Set-up

Run Relatedness installation and requirements

The workflow code can be downloaded by cloning the WARP GitHub repository. For the latest release, please see the run_relatedness changelog.

The pipeline can be deployed using Cromwell, a GA4GH-compliant workflow management system.

Inputs

Input descriptions

Input variable nameDescriptionType
vcf_urlPath to cohort VCF with genotypes used for relatedness.String
pca_scores_urlPath to PCA scores corresponding to samples in vcf_url.String
task_identifierIdentifier used as output filename prefix.String
statisticsStatistic to compute. Default: 'kin'.String
min_individual_mafMinimum individual-specific minor allele frequency. Default: 0.01.Float
block_sizeBlock matrix size used by the algorithm. Default: 2048.Int
min_kinshipMinimum kinship threshold for reported sample pairs. Default: 0.1.Float
min_partitionsMinimum number of partitions used for optimization. Default: 1200.Int
gcs_output_urlGCS path for relatedness pipeline outputs.String
executor_coresSpark executor core count.String
driver_coresSpark driver core count.String
executor_memorySpark executor memory setting.String
driver_memorySpark driver memory setting.String
reference_genomeReference genome identifier (e.g., hg38).String
max_idleDataproc cluster max idle time in minutes. Default: 60.Int
max_ageDataproc cluster max age in minutes. Default: 1440.Int
num_workersNumber of Hail Dataproc workers.Int
gcs_projectGoogle Cloud project ID used for Dataproc.String
gcs_subnetwork_nameSubnetwork name for Dataproc networking. Default: 'subnetwork'.String
submission_scriptPython script submitted to Dataproc to compute relatedness.File
regionDataproc region. Default: us-central1.String
hail_dockerDocker image for Dataproc orchestration task.String
hail_docker_maximal_independent_setDocker image for maximal independent set task.String

Run Relatedness tasks and tools

The workflow runs two tasks: one for cluster-based relatedness computation and one for maximal independent set filtering.

  1. Compute pairwise relatedness on Dataproc
  2. Compute maximal independent set

To see specific tool parameters, select the task WDL link in the table; then view the command {} section of the task in the WDL script.

Task name and WDL linkToolSoftwareDescription
run_relatedness_taskHail + Dataprocus.gcr.io/broad-dsde-methods/lichtens/hail_dataproc_wdl:1.1Creates Dataproc cluster, submits relatedness computation job, and copies back relatedness TSV.
run_maximal_independent_setHail graph methodshailgenetics/hail:0.2.67Computes maximal independent set from related pairs and exports flagged samples.

1. Compute pairwise relatedness on Dataproc

run_relatedness_task provisions a temporary Dataproc cluster, submits the user-specified script with cohort inputs, and copies <task_identifier>_relatedness.tsv from cluster staging storage.

2. Compute maximal independent set

run_maximal_independent_set reads the relatedness table and applies hl.maximal_independent_set to identify a sample subset for removal, exporting the result to <task_identifier>.relatedness_flagged_samples.tsv.

Outputs

Output variable nameFilename, if applicableOutput format and description
relatedness<task_identifier>_relatedness.tsvPairwise relatedness output table containing sample-pair relatedness values.
relatedness_flagged_samples<task_identifier>.relatedness_flagged_samples.tsvTable of samples flagged by maximal independent set for potential exclusion.

Versioning

All run_relatedness releases are documented in the changelog.

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.