Skip to content

Latest commit

 

History

History
290 lines (216 loc) · 10.1 KB

README.md

File metadata and controls

290 lines (216 loc) · 10.1 KB

Bento Workflow Execution Service (WES)

Test Status Lint Status codecov

Developing bento_wes requires Python 3.10+ and Poetry >=1.5.1.

Overview

Workflow execution service for the Bento platform. This service implements the GA4GH WES API schema with additional Bento-specific features.

Workflow definition

A workflow is based on a .wdl file which defines the different tasks with their related I/O dependencies (i.e. which variables or files are required as input, and what is the output of the workflow). See the Workflow Definition Language Specs for more information. A mandatory JSON file containing the required metadata (variables values, file names, etc... to be used by the workflow) is also provided.

Where are workflows defined in Bento?

In Bento, each data related service (e.g. Katsu, Gohan) stores its own workflows in a /**/workflows/ directory. The workflows can be requested from the workflows API endpoints exposed by these microservices (e.g. list all workflows, show details or download .wdl file for a specific workflow). Note that 2 different files are exported: the regular.wdl file (consumed by bento-wes) and a JSON file which lists the inputs and the outputs of the workflow. This JSON file is consumed by bento-web to build the corresponding form for the user and propagate some settings to the workflow metadata. The files generated by the workflow are retrieved (e.g. storage in DRS) using the output variables defined in this secondary JSON document (not part of the WDL specs). Finally, the metadata is generated by the bento-web (!not bento-wes!) service when the execution is triggered, including the reference to the workflow.

Workflows execution

The WES container may receive a /runs POST request to execute a given workflow with specified metadata. The WES service then queries the worflow provider to get the relevant .wdl file which is copied over in a temporary execution folder, along with the metadata as JSON.

The Cromwell workflow management system is used to execute the WDL files. In a first step, the dependencies such as input files are copied over locally. Note that in development mode, the temporary files are not cleaned up after completion.

Each run is monitored and its state is stored in a local database.

Note that some metadata may contain callback urls which are called once the workflow described in the wdl file has been executed. This is the case for Katsu ingestions workflows.

File sharing between services

The WES needs to access the files used as input. It may also pass references to files to other services as part of the workflow. For example during an ingestion workflow, a file must be passed to the relevant data service for ingestion in its internal database. This file transfer is based on mounted volumes shared between the containers.

Of note, the wes/tmp directory is mounted in some data service container (with the exception of Gohan which mounts the dropbox data directory instead). When a workflow is executed, this is where the necessary input files are stagged. This side effect is used to pass files for ingestion to the relevant containers. Some workflows (ingestions workflows in Katsu) contain an "identity" task which only takes a path to a dropbox file as input and returns a local path to a temp file. Note that the /wes/tmp volume must be mounted to the same path in every container for this to work seamlessly.

REST API

/service-info Type

ca.c3g.bento:wes:VERSION

/runs POST

Parameter:

{
  "workflow_params": {/* ... */},    // unused?
  "workflow_type": "WDL",
  "worflow_type_version": "1.0",
  "workflow_engine_parameters": {}, // unused
  "workflow_url": "...",      // where the WES can fetch the wdl file,
  "tags": {
    "workflow_id": "...",     // must correspond to the worflow_params namespace
    "workflow_metadata": {
      "inputs": [{}]   // Defines setup for injecting values into the .wdl input section. IDs must align.
    }
  }
}

Note: this diverges from GA4GH recommendations: tags.workflow_metadata should be in workflow_params. The usage of the tags property is Bento specific and the callback mechanism should probably be part of the tasks definition.

/runs GET

Parameter Optional with_details (BOOL) Lists all runs.

/runs/<uuid> GET

Details of the run corresponding to the uuid

/runs/<uuid>/stdout GET ; /runs/<uuid>/stderr GET

Stream of run's stdout or sterr respectively

/runs/<uuid>/cancel POST

Cancel run

/runs/<uuid>/status GET

Get run state

Environment Variables

# Bento instance or service base URL, used for generating absolute URLs within
# the service, for making requests, and for re-writing internal URLS in the case
# of Singularity-based Bento instances
BENTO_URL=http://127.0.0.1:5000/

# Debug mode for the service - falls back to FLASK_ENV (development = true,
# any other value = false) if not set
# SECURITY NOTE: This SHOULD NOT EVER be enabled in production, as it removes
# checks for TLS certificate validity!
BENTO_DEBUG=False

# SSL Configuration - whether to validate certificates
BENTO_VALIDATE_SSL=True

# Celery configuration
CELERY_RESULT_BACKEND=redis://
CELERY_BROKER_URL=redis://

# Event Redis connection
BENTO_EVENT_REDIS_URL=redis://localhost:6379

# Run/task database location
DATABASE=data/bento_wes.db

# Service configuration
# - unique ID service within for Bento instance
SERVICE_ID=
# - persistent data directory - this is used for file output artifacts from 
#   workflows, which is especially useful for analysis/export workflows.
SERVICE_DATA=data
# - temporary data directory - the service currently does not make this by
#   itself, so this must be created prior to startup
SERVICE_TEMP=tmp
# - base url for service endpoints
SERVICE_BASE_URL=http://localhost:5000/

# Location of WOMtool, used to validate WDL files
# - If not set, no WDL validation will be done
# - SECURITY: If not set, WDL_HOST_ALLOW_LIST must contain a comma-separated
#   list of hosts workflow files can be downloaded from
WOM_TOOL_LOCATION=/path/to/womtool.jar

# Allow-list (comma-separated) for hosts that workflow files can be downloaded
# from - prevents possibly insecure WDLs from being ran
WORKFLOW_HOST_ALLOW_LIST=

# Service URL configuration:
BENTO_AUTHZ_SERVICE_URL=
DRS_URL=https://portal.bentov2.local/api/drs
SERVICE_REGISTRY_URL=

# CORS
CORS_ORIGINS='*'

Events

wes_run_updated: TODO

wes_run_completed: TODO

Development

Setting up a Virtual Environment

After cloning the repository, let Poetry manage the virtual environment and install the development dependencies for you:

pip install poetry  # if not done so already
poetry install  # will automatically create a virtual environment

Running Tests

To run all tests and linting, use the following command:

poetry run tox

Releases

Release Checklist

  • All tests pass

  • Package version has been updated (following semver) in bento_lib/package.cfg

  • A release can then be created, tagged in the format of v#.#.# and named in the format of Version #.#.#, listing any changes made, in the GitHub releases page tagged from the master branch!

Note on Versioning

The bento_wes project uses semantic versioning for releasing. If the API is broken in any way, including minor differences in the way a function behaves given an identical set of parameters (excluding bugfixes for unintentional behaviour), the MAJOR version must be incremented. In this way, we guarantee that projects relying on this API do not accidentally break upon upgrading.

Deploying

The bento_wes service can be deployed with a WSGI server like Gunicorn or UWSGI, specifying bento_wes.app:application as the WSGI application.

It is best to then put an HTTP server software such as NGINX in front of Gunicorn.

Flask applications should NEVER be deployed in production via the Flask development server, i.e. flask run!

To run the Celery worker (required to actually run jobs), the following command (or similar) can be used:

nohup poetry run celery --loglevel=INFO --app bento_wes.app worker &> celery.log &

About the implementation

This service is built around a Flask application. It uses Celery to monitor and run workflows executed by Cromwell.

The workflows are downloaded from local services.

There are no checks on the workflows validity in that case (assumption that workflows coming from configured hosts are correct, see above WORKFLOW_HOST_ALLOW_LIST env variable).

For now the WOMtool utility used for checking .wdl files validity is disabled in Bento (see the corresponding Dockerfile).

runs.py

This script contains the routes definitions as Flask's Blueprints

runner.py

This script contains the implementation of workflow execution.

The expected inputs come from the workflow metadata (Bento-specific), which also define how bento_web will render the workflow set-up UI.

Another extension to the workflow metadata inputs is used to get values from the WES configuration variables. The special value FROM_CONFIG causes the interpolation to the Flask app.config property matching the id in uppercase. In the following example, the value for this variable will come from the config property KATSU_URL.

{
  // ...,
  "inputs": [
    {
      "id": "katsu_url",
      "type": "string",
      "required": true,
      "value": "FROM_CONFIG",
      "hidden": true,
    }, // ...
  ],
  // ...
}