You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These are some notes to get us started on a discussion of complex workflow coordination within BEE.
CWL is the language used to describe a workflow. A runner is the software that executes a workflow on an actual system. At LANL, our runner is BEE. The reference runner is cwltool.
When it comes to the file system, cwltool is in complete control. A workflow is run on a single node, with a local filesystem. cwltool moves input and output files to and from temporary directories that it creates. Therefore, it knows where files are supposed to be and can check that they were successfully created and are available to subsequent workflow steps (setting the step's working directory).
We don't have that luxury in BEE. Our workflows run on distributed systems. The WFM and TM are usually only connected by the network. There is no common file system. This may create problems for us and we need to work through how we'll track files and ensue that they're where they're supposed to be for subsequent steps.
We're currently looking at implementing two complex workflows_: clamr_wf and vasp.
clamr_wf executes two logical steps: run clammr to generate timestep image files, then run ffmpeg to create a movie from all of them. clamr writes all its output files to a directory. ffmpeg reads files from that directory, using a known filespec (e.g. graph%05d.png) to make a movie. A couple of coordination issues we need to discuss:
How do we (WFM) ensure that the output directory (with a known name) was successfully created. cwltool can just look in its temp directory and check that the file is there. The WFM can't do that. We could make the TM (somehow) send back a list of created files. Or, we could just assume the file exists and fail the subsequent step if not.
ffmpeg reads files according to a filespec that includes the full path (e.g. /home/images/graph%05d.png). We know both of these values a priori (/home/images and graph%05d.png). But, since we can't (yet) use Javascript, we need that third workflow step to concatenate the two. Maybe there's another way to do this.
vasp (no CWL yet) is also a two step workflow. vasp (an MPI code) runs to generate a batch of output files for its particular input parameters. The second step consists of n analysis jobs, one for each of the n files output by vasp. These files can all run concurrently using the scattering feature of CWL. Some things to discuss about this workflow:
Our current thinking is that we will create a pseudo task for the analysis tasks. This node is dependent on the completion of vasp. At some point this pseudo task will be expanded into n real Task nodes to be executed concurrently by the WFM and TM.
How is the WFM to know the filenames generated by vasp. The pseudo task depends on a array of filenames returned by vasp. These filenames must come back from the TM (somehow).
How do we maintain provenance when we're munging with the graph via node expansion? Can we keep versions of nodes that were expanded?
How will the WFM know when all scattered tasks are complete? This actually should just work the way we do things now. Workflow termination (or a subsequent step) is dependent (in the database) on completion of all analysis tasks.
Something to keep in mind: the vasp folks actually want to run parameter studies. In that case, we'd be scattering over a set of input parameters, and running an entire vasp/analysis graph for each set of parameters.
This ought to get us started. Feel free to post comments/clarifications/questions before our meeting on Wednesday (9/1). Especially @Boogie3D
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
These are some notes to get us started on a discussion of complex workflow coordination within BEE.
CWL is the language used to describe a workflow. A runner is the software that executes a workflow on an actual system. At LANL, our runner is BEE. The reference runner is
cwltool
.When it comes to the file system,
cwltool
is in complete control. A workflow is run on a single node, with a local filesystem.cwltool
moves input and output files to and from temporary directories that it creates. Therefore, it knows where files are supposed to be and can check that they were successfully created and are available to subsequent workflow steps (setting the step's working directory).We don't have that luxury in BEE. Our workflows run on distributed systems. The WFM and TM are usually only connected by the network. There is no common file system. This may create problems for us and we need to work through how we'll track files and ensue that they're where they're supposed to be for subsequent steps.
We're currently looking at implementing two complex workflows_:
clamr_wf
andvasp
.clamr_wf
executes two logical steps: runclammr
to generate timestep image files, then runffmpeg
to create a movie from all of them.clamr
writes all its output files to a directory.ffmpeg
reads files from that directory, using a known filespec (e.g.graph%05d.png
) to make a movie. A couple of coordination issues we need to discuss:How do we (WFM) ensure that the output directory (with a known name) was successfully created.
cwltool
can just look in its temp directory and check that the file is there. The WFM can't do that. We could make the TM (somehow) send back a list of created files. Or, we could just assume the file exists and fail the subsequent step if not.ffmpeg
reads files according to a filespec that includes the full path (e.g./home/images/graph%05d.png
). We know both of these values a priori (/home/images
andgraph%05d.png
). But, since we can't (yet) use Javascript, we need that third workflow step to concatenate the two. Maybe there's another way to do this.vasp
(no CWL yet) is also a two step workflow.vasp
(an MPI code) runs to generate a batch of output files for its particular input parameters. The second step consists of n analysis jobs, one for each of the n files output byvasp
. These files can all run concurrently using the scattering feature of CWL. Some things to discuss about this workflow:Our current thinking is that we will create a pseudo task for the analysis tasks. This node is dependent on the completion of
vasp
. At some point this pseudo task will be expanded inton
realTask
nodes to be executed concurrently by the WFM and TM.How is the WFM to know the filenames generated by
vasp
. The pseudo task depends on a array of filenames returned byvasp
. These filenames must come back from the TM (somehow).How do we maintain provenance when we're munging with the graph via node expansion? Can we keep versions of nodes that were expanded?
How will the WFM know when all scattered tasks are complete? This actually should just work the way we do things now. Workflow termination (or a subsequent step) is dependent (in the database) on completion of all analysis tasks.
Something to keep in mind: the
vasp
folks actually want to run parameter studies. In that case, we'd be scattering over a set of input parameters, and running an entirevasp
/analysis graph for each set of parameters.This ought to get us started. Feel free to post comments/clarifications/questions before our meeting on Wednesday (9/1). Especially @Boogie3D
Beta Was this translation helpful? Give feedback.
All reactions