Prefect 2

Prefect 2 is a python framework for developing data processing pipelines. It provides wrappers around python functions that in turn provide enough information for the inputs and outputs between python functions to be tracked, which are subsequently used to create a directed work graph. By abstracting work in this manner, compute infrastructure can be coordinated to operate through the items described by the graph. This abstraction is a fairly power mechanism that provides a great deal of flexibility.

The additional benefit of prefect is that it integrates a web-service that provides: - dashboard to visualise the data workflows, their execution state, and logs - a RESTful API that may be used to interact with the underlying database - ability to establish polling agents deployed on compute infrastructure which can then be instructed to perform work

Where is it used?

The Prefect framework is currently used by process_holography.py to create a holography processing pipeline. This script is, in fact, a collection of wrappers around other scripts available in the aces.holography sub-module.

How to set up?

Ultimately, if there are no PREFECT environment variables, when a prefect-based pipeline is executed a temporary orion server is created using an sqlite backend to coordinate a workflow. This orion server is the orchestration engine that will instruct prefect task runners which components of the work graph should be commenced. The orion server is in-memory, listens on localhost for incoming interactions with its REST API, and will only persist while the main workflow is running.

Warning

If pipelines are large or are distributed across a network (e.g. SLURM compute nodes) that it is possible that the sqlite engine is not able to keep up with incoming connections, especially on a lustre based file system. If errors related to ‘file locks’ are thrown a postgress database should be used instead.

Included in this repository is the setup_orion.sh script, which is intended to be used to: * create an instance of postgres packaged within a singularity container * create a orion server instance listening on port 4200 across all incoming interfaces * establish appropriate PREFECT environment parameters to allow the orion instance to use postgres as its database backend.

The requirments of this script is an environment with: * internet connectivity, ideally behind a firewall with an allow list * singularity to be installed * a python environment with prefect and asyncpg installed

Once this has been carried out, experience shows that the stability and scalability of prefect based workflows to greatly increase.

Note

On the system carrying out the computation (e.g. galaxy) simply running this setup_orion.sh script without any optional arguments will create the expected PREFECT environment variables that enable the prefect framework to seemlessly interface with orion server. The important environment variable that facilitates this on the compute infrastructure is PREFECT_API_URL.

Note

The setup_orion.sh scrupt does not have appropriate POSTGRES_PASS and POSTGRES_ADDR variables set. They will beed to be added for the script to run.

Creating deployments

A deployment is a way of registering a prefect workflow with the orion server. Here, workflow is refering to a python script that contains the @flow decorator, which may be comprised of many tasks. In turn, once this flow is registed, polling agents running on compute infrastructure can received work (e.g. be assigned workflows to carry out).

Provided that a workflow is in a accessible script, a deployment may be created with:

prefect deployment build ./process_holography.py:main --name holo-galaxy -i process --tag holo --apply

In the above example: * process_holography.py script has its main function decorated by @flow, which is sufficent to register the workflow against the orion server * --name provides a name of the workflow on orion that is later used to remotely define workflow invocations * -i process instructs the workflow to be carried out as a subprocess (with alternatives being as sophisticated as docker and/or kubernetes backed clusters) * --tag holo is a simple tag used to help logically manage sets of pipelines * --apply will register the resulting main-deployment.yaml against the orion server.

Although a process execution option is set above, the process_holography pipeline will internally request SLURM-backed resources, which is where the heavy lifting compute is carried out.

Note

By default, the above command will associate the task with a default queue. This may be changed via a -q argument.

Warning

If --storage-block is not used, a LocalFileSystem block is assumed, and prefect will create its root to the the working directory at the time the prefect deployment build command was issued. This has a nasty un-intended side effect of copying that path (and the contents of sub-directories below it) to the /tmp directory of the resource running a prefect agent. Until a work around has been found, it is suggested to build the deployments in a relatively empty folder. This behaviour is only a problem if you are executing the agent on the same system as the one that built the deployment, or a system with the same path.

Running an agent

A prefect agent will interact with an orion instance over the PREFECT_API_URL to monitor the prefect work queues, and start any items should something be presented. Starting an agent may be done with:

prefect agent start -q 'default'

which will register an agent against a default queue. Agents may be configured to listen to multiple queues and/or tags.

Note

Note that the agent should be started in the intended execution environment. That is to say, you should activate the appropriate set of virtual environments and establish appropriate environment variables before you issue the prefect agent start command.

Running a deployment

Provided a workflow has been registered against the orion server, and an agent is listening on the appropriate queue, workflows can be remotely invoked. Issuing

prefect deployment run "Holography Main Flow/holo-galaxy" -p sbid=44641 -p workdir="/askapbuffer/payne/tgalvin/holography/prefect2_44641"

Will start the registered "Holography Main Flow/holo-galaxy" wokflow with the key=value pairs specified after the -p options as arguments into the workflow.