Notes for discussion of WFM/TM coordination of complex workflows #371

mcpherson · 2021-08-30T21:18:54Z

mcpherson
Aug 30, 2021

These are some notes to get us started on a discussion of complex workflow coordination within BEE.

CWL is the language used to describe a workflow. A runner is the software that executes a workflow on an actual system. At LANL, our runner is BEE. The reference runner is cwltool.

When it comes to the file system, cwltool is in complete control. A workflow is run on a single node, with a local filesystem. cwltool moves input and output files to and from temporary directories that it creates. Therefore, it knows where files are supposed to be and can check that they were successfully created and are available to subsequent workflow steps (setting the step's working directory).

We don't have that luxury in BEE. Our workflows run on distributed systems. The WFM and TM are usually only connected by the network. There is no common file system. This may create problems for us and we need to work through how we'll track files and ensue that they're where they're supposed to be for subsequent steps.

We're currently looking at implementing two complex workflows_: clamr_wf and vasp.

clamr_wf executes two logical steps: run clammr to generate timestep image files, then run ffmpeg to create a movie from all of them. clamr writes all its output files to a directory. ffmpeg reads files from that directory, using a known filespec (e.g. graph%05d.png) to make a movie. A couple of coordination issues we need to discuss:

How do we (WFM) ensure that the output directory (with a known name) was successfully created. cwltool can just look in its temp directory and check that the file is there. The WFM can't do that. We could make the TM (somehow) send back a list of created files. Or, we could just assume the file exists and fail the subsequent step if not.
ffmpeg reads files according to a filespec that includes the full path (e.g. /home/images/graph%05d.png). We know both of these values a priori (/home/images and graph%05d.png). But, since we can't (yet) use Javascript, we need that third workflow step to concatenate the two. Maybe there's another way to do this.

vasp (no CWL yet) is also a two step workflow. vasp (an MPI code) runs to generate a batch of output files for its particular input parameters. The second step consists of n analysis jobs, one for each of the n files output by vasp. These files can all run concurrently using the scattering feature of CWL. Some things to discuss about this workflow:

Our current thinking is that we will create a pseudo task for the analysis tasks. This node is dependent on the completion of vasp. At some point this pseudo task will be expanded into n real Task nodes to be executed concurrently by the WFM and TM.
How is the WFM to know the filenames generated by vasp. The pseudo task depends on a array of filenames returned by vasp. These filenames must come back from the TM (somehow).
How do we maintain provenance when we're munging with the graph via node expansion? Can we keep versions of nodes that were expanded?
How will the WFM know when all scattered tasks are complete? This actually should just work the way we do things now. Workflow termination (or a subsequent step) is dependent (in the database) on completion of all analysis tasks.
Something to keep in mind: the vasp folks actually want to run parameter studies. In that case, we'd be scattering over a set of input parameters, and running an entire vasp/analysis graph for each set of parameters.

This ought to get us started. Feel free to post comments/clarifications/questions before our meeting on Wednesday (9/1). Especially @Boogie3D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes for discussion of WFM/TM coordination of complex workflows #371

{{title}}

Replies: 0 comments

Select a reply

Notes for discussion of WFM/TM coordination of complex workflows #371

mcpherson Aug 30, 2021

These are some notes to get us started on a discussion of complex workflow coordination within BEE.

Replies: 0 comments

mcpherson
Aug 30, 2021