Skip to content

Using WFL Across a Directory⚓︎

WFL supports starting workflows from a single file each--depending on the pipeline you specify, other inputs will be extrapolated (see WFL's docs for the specific pipeline for more information).

If you have a set of files uploaded to a GCS bucket and you'd like to start a workflow for each one, you can do that via shell scripting.

Suppose we have a set of CRAMs in a folder in some bucket, and we'd like to submit them all to WFL for ExternalExomeReprocessing (perhaps associated with some project or ticket, maybe PO-1234). We'll write a short bash script that will handle this for us.

Tip

Make sure you're able to list the files yourself! You'll need permissions and you may need to run gcloud auth login

Step 1: List Files⚓︎

We need a list of all the files you intend to process. This'll depend on the file location, gs://broad-gotc-dev-wfl-ptc-test-inputs/ for example. We can use wildcards to list out the individual files we'd like. Make some scratch file like script.sh and store the list of CRAMs in a variable:

# In script.sh

CRAMS=$(gsutil ls 'gs://broad-gotc-dev-wfl-ptc-test-inputs/**.cram')

Step 2: Format Items⚓︎

First, we need to turn that string output into an actual list of file paths. We can use jq to split into lines and select ones that are paths:

FILES=$(jq -sR 'split("\n") | map(select(startswith("gs://")))' <<< "$CRAMS")

Next, we need to format each of those file paths into inputs. WFL doesn't just accept a list of files because we allow configuration of many other inputs and options.

ITEMS=$(jq 'map({ inputs: { input_cram: .} })' <<< "$FILES")

Info

If you want to process BAMs, you'll need to use input_bam instead of input_cram above.

Step 3: Make Request⚓︎

Now, we can simply insert those items into a normal ExternalExomeReprocessing workload request:

REQUEST=$(jq '{
    cromwell: "https://cromwell-gotc-auth.gotc-prod.broadinstitute.org",
    output: "gs://broad-gotc-dev-wfl-ptc-test-outputs/xx-test-output",
    pipeline: "ExternalExomeReprocessing",
    project: "PO-1234",
    items: .
}' <<< "$ITEMS")

Info

Remember to change the output bucket! And the project isn't used by WFL but we keep track of it to help you organize workloads based on tickets or anything else.

Info

You can make other customizations here too, like specifying some input or option across all the workflows by adding a common block. See the docs for your pipeline or the workflow options page for more info.

Last, we can use curl to send off the request to WFL:

curl -X POST 'https://dev-wfl.gotc-dev.broadinstitute.org/api/v1/exec' \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H 'Content-Type: application/json' \
    -d "$REQUEST"

Warning

Curl will complain if the $REQUEST here contains more than thousand lines of data. Remember to dump the payload to a file such as payload.json and let Curl read from that file instead in that case. For example, the last step can be replaced by:

echo "$REQUEST" >> "payload.json"

curl -X POST 'https://dev-wfl.gotc-dev.broadinstitute.org/api/v1/exec' \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H 'Content-Type: application/json' \
    -d "@payload.json"

With this, the final result is something like the following:

CRAMS=$(gsutil ls 'gs://broad-gotc-dev-wfl-ptc-test-inputs/**.cram')

FILES=$(jq -sR 'split("\n") | map(select(startswith("gs://")))' <<< "$CRAMS")

ITEMS=$(jq 'map({ inputs: { input_cram: .} })' <<< "$FILES")

REQUEST=$(jq '{
    cromwell: "https://cromwell-gotc-auth.gotc-prod.broadinstitute.org",
    output: "gs://broad-gotc-dev-wfl-ptc-test-outputs/xx-test-output",
    pipeline: "ExternalExomeReprocessing",
    project: "PO-1234",
    items: .
}' <<< "$ITEMS")

curl -X POST 'https://gotc-prod-wfl.gotc-prod.broadinstitute.org/api/v1/exec' \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H 'Content-Type: application/json' \
    -d "$REQUEST"

Save that as script.sh and run with bash myscript.sh and you should be good to go!

Other Notes⚓︎

Have a lot of workflows to submit? You can use array slicing to help split things up:

FILES=$(jq -sR 'split("\n") | map(select(startswith("gs://")))[0:5000]' <<< "$CRAMS")

Need to select files matching some other query too? You can chain the map-select commands and use other string filters on the file names:

FILES=$(jq -sR 'split("\n") | map(select(startswith("gs://"))) |
    map(select(contains("foobar")))' <<< "$CRAMS")

If contains/startswith/endswith aren't enough, you can use test with PCRE regex:

FILES=$(jq -sR 'split("\n") | map(select(startswith("gs://"))) |
    map(select(test("fo+bar")))' <<< "$CRAMS")

See this page for more jq info.