Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor WPS outputs locations for alternate contents #745

Open
fmigneault opened this issue Oct 24, 2024 · 0 comments
Open

Refactor WPS outputs locations for alternate contents #745

fmigneault opened this issue Oct 24, 2024 · 0 comments
Assignees
Labels
triage/feature New requested feature.

Comments

@fmigneault
Copy link
Collaborator

fmigneault commented Oct 24, 2024

Description

Problem

Directories for job outputs/results are defined as follows:

{WPS_OUTPUT_URL}/{JOB_UUID}.xml # status location
{WPS_OUTPUT_URL}/{JOB_UUID}.log # execution logs
{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.ext} # results of the job if successful

With potentially a nested rebase using the X-WPS-Output-Context:

For example, providing ``X-WPS-Output-Context: project/test-1`` will result in outputs located at:
.. code-block::
{WPS_OUTPUT_URL}/project/test-1/{JOB_UUID}/{outputID}/{output.ext}

With the integration of result transforms (#548), there would also be (where alt is any extension mapped by ext transforms):

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.alt}

With added provenance (#673), there is also a need for the following, with each possible original/transformed result to provide PROV metadata in various representations:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.[ext|alt]}.prov.[json|xml|rdf]

Another edge case is for an output application/directory, where the contents are:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{tree-dir|file}

With Zip (#726 or other container) transforms from a application/directory, there is an issue where there is no way to generate "output.zip" or the various {...}.prov.[json|xml|rdf] without potentially introducing a conflict of whatever contents the output directory contains. Furthermore, doing subsequent provenance or transform requests could end up nesting the prov/alt files under an archive (eg: output.tar.gz or output.prov.json zipped within a output.zip because they were requested before).

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{tree-dir|file}
{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/output.zip
{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/output.prov.json

Possible Solutions

1. Using nested directories

One way to address all above would be to refactor the directory structure as follows:

{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/status/{JOB_UUID}.log
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/status/{JOB_UUID}.xml
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/outputs/{outputID}/{output.ext}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/transforms/{outputID}/{output.alt}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/prov/{outputID}/{output.prov.[json|xml|rdf]}

With hard coded status, outputs, transforms, prov sub-directories, we achieve multiple advantages:

  • allows extending contents related to results with any future capability/representation, simply by adding a new sub-directory as needed
  • reduces listing of the WPS-output directories that currently duplicate 3 entries each time: {JOB_UUID}.xml, {JOB_UUID}.log, {JOB_UUID}/, since they will all be under a single {JOB_UUID}/ directory
  • separating outputs from transforms make it clear which one is the original vs the generated alternate contents
Considerations
  • This change will cause any existing job to be unable to dynamically generate an alternate transform representation, because the nested /outputs/ would not be resolved.
  • Generating the base /outputs/ location will have to consider the various output storages (eg: PyWPS local dir store vs AWS S3 store) that have different ways to indicate the prefix location
  • Job preparation would have to set up the /status/ directory for the XML and logs. This might impact however the PyWPS workers are configured, and how this information is chained across the execution pipeline.
  • Workflows that assume certain nested dir locations with {outputID} would have to dynamically resolve the paths.

2. Adjusting the dir result

Another approach would be to preserve the current directory structure, but only adjust results using application/directory such that it is nested within another /dir/ sub-directory:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/dir/{tree-dir|file}

This would also allow other metadata to be represented as:

{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir/{tree-dir|file}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir.{alt}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir.prov.[json|xml|rdf]}
Considerations
  • Only application/directory outputs would need to be adjusted to consider the hardcoded /dir/.
  • Workflows would need to be updated, but his can be easily addressed since many parts of the code already have special handling for application/directory

References

@fmigneault fmigneault added the triage/feature New requested feature. label Oct 24, 2024
@fmigneault fmigneault self-assigned this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/feature New requested feature.
Projects
None yet
Development

No branches or pull requests

1 participant