Retrying Workflows⚓︎
Retrying Terra Workflows via WFL API⚓︎
WFL staged workloads with a
Terra executor
have a /retry
endpoint that selects unretried workflows
by their submission ID and re-submits them.
The following curl
shell command finds the unretried workflows
launched by submission $SUBMISSION
in workload $UUID
and resubmits the underlying snapshot for processing.
```bash WFL=https://gotc-prod-wfl.gotc-prod.broadinstitute.org/api/v1/workload
AUTH="Authorization: Bearer $(gcloud auth print-access-token)"
UUID=0d307eb3-2b8e-419c-b687-8c08c84e2a0c # workload UUID SUBMISSION=14bffc69-6ce7-4615-b318-7ef1c457c894 # Terra submission UUID
curl -X POST -H "$AUTH" $WFL/$UUID/retry \ --data "{\"submission\":\"$SUBMISSION\"}" \ | jq ```
A successful /retry
request returns the workload specified by $UUID
.
A failed /retry
request will return a description of the failure.
For legacy (non-staged) workloads, the /retry
endpoint is unimplemented
and returns a 501
HTTP failure status.
In such cases, retries may be facilitated by the
runbook below.
Request Body⚓︎
The request body filters must be valid:
- Mandatory
submission
- Terra submission ID (must be a valid UUID) - Optional
status
- Workflow status (if specified, must be a retriable Cromwell workflow status)
The only
Cromwell statuses
supported with the /retry
API
are the terminal workflow statuses:
"Aborted"
"Failed"
"Succeeded"
Why would you retry succeeded workflows?
A workflow may have functionally succeeded, but be scientifically inaccurate and need to be rerun, e.g. if the initial run contained incorrect metadata.
Attempting to retry workflows of any other status
will return a 400
HTTP failure status,
as will a valid combination of filters with no matching workflows in WFL's DB.
Examples:
- A valid Terra submission ID for a different workload
"Failed"
workflow status when all unretried workflows had"Succeeded"
Warnings and Caveats⚓︎
Submission of snapshot subsets not yet supported⚓︎
WFL is limited by Rawls functionality and cannot yet submit a subset of a snapshot. So retrying any workflow from a workload snapshot will resubmit all entities from that snapshot.
(Because of this, the optional workflow status filter is purely decorative: all sibling workflows from the same submission will be resubmitted, regardless of their status.)
Example - a running submission from a snapshot has 500 workflows:
-
1 failed
-
249 running
-
250 succeeded
Retrying the failed workflow will create a new submission where all 500 original workflows are retried.
Consider whether you should wait for all workflows in the submission to complete before initiating a retry to avoid multiple workflows running concurrently in competition for the same output files.
Race condition when retrying the same workload concurrently⚓︎
A caller could hit this endpoint for the same workload multiple times in quick succession, making possible a race condition where each run retries the same set of workflows.
Future improvements will make this operation threadsafe, but in the interim try to wait for a response from your retry request before resubmitting.
Retrying Failures via WFL Runbook⚓︎
For legacy (non-staged) workloads, WFL remembers enough about submissions to let you quickly resubmit failed workflows with the same inputs/options as they were originally submitted.
All you need is a query string like you'd pass to the /workflows
endpoint,
either:
uuid=<UUID>
where<UUID>
is the identifier of the specific workload you'd like to retry failures from- Ex:
uuid=95d536c7-ce3e-4ffc-8c9c-2b9c710d625a
- Ex:
project=<PROJECT>
where<PROJECT>
is the value of the project field of the workloads you'd like to retry- Ex:
project=PO-29619
- Ex:
With the below script, WFL will find matching workloads and resubmit any unique failures of individual workflows in a new workload (with the same parameters as the originals).
Usage: bash retry.sh QUERY
Ex: bash retry.sh project=PO-29619
```bash
Usage: bash abort.sh QUERY [WFL_URL]⚓︎
QUERY is either like project=PO-123
or uuid=1a2b3c4d
⚓︎
WFL_URL is the WFL instance to retry workflows from [default: gotc-prod]⚓︎
WFL_URL="${2:-https://gotc-prod-wfl.gotc-prod.broadinstitute.org}" AUTH_HEADER="Authorization: Bearer $(gcloud auth print-access-token)"
getWorkloads () { # Query -> [Workload] curl -s -X GET "${WFL_URL}/api/v1/workload?$1" \ -H "${AUTH_HEADER}" \ | jq }
getWorkflows() { # Workload -> [Workflow] uuid=$(jq -r .uuid <<< "$1") curl -s -X GET "${WFL_URL}/api/v1/workload/${uuid}/workflows" \ -H "${AUTH_HEADER}" \ | jq }
failedWorkflowsToSubmit() { # [[Workflow]] -> [Workflow] jq 'flatten | map ( select ( .status=="Failed" ) | {inputs: .inputs, options: .options} | del ( .[] | nulls ) ) ' <<< "$1" }
makeRetryRequest() { # [Workload], [Workflow] -> Request jq --argjson 'workflows' "$2" \ '.[0] | { executor: .executor , input: .input , output: .output , pipeline: .pipeline , project: .project , items: $workflows } | del(.[] | nulls) ' <<< "$1" }
mapjq () { jq -c '.[]' <<< "${2}" \ | while read elem; do ${1} "${elem}"; done \ | jq -s '[ .[] ]' }
main() { # Query -> () workloads=$(getWorkloads "${1}") workflows=$(mapjq getWorkflows "${workloads}") toSubmit=$(failedWorkflowsToSubmit "${workflows}") makeRetryRequest "${workloads[0]}" "${toSubmit}" > /tmp/retry.json
curl -X POST "${WFL_URL}/api/v1/exec" \
-H "${AUTH_HEADER}" \
-H "Content-Type: application/json" \
-d @/tmp/retry.json
}
main "$1" ```
Tips⚓︎
Customizing Inputs/Options⚓︎
If you want to inject a new input or option into all of the retried workflows,
you can do that with a common
block. For example, replace this:
jq '{
executor: .executor,
with this:
jq '{
common: { inputs: { "WholeGenomeReprocessing.WholeGenomeGermlineSingleSample.BamToCram.ValidateCram.memory_multiplier": 2 } },
executor: .executor,
That example uses WFL's arbitrary input feature to bump up the memory multiplier for a particular WGS task.
- Nested inputs will have periods in them, you'll need to use quotes around it
- You can't override inputs or options that the workflows originally had (the
common
block has lower precedence)