Refactor WPS outputs locations for alternate contents #745

fmigneault · 2024-10-24T19:53:53Z

Description

Problem

Directories for job outputs/results are defined as follows:

Lines 1940 to 1942 in c121f8d

    
               {WPS_OUTPUT_URL}/{JOB_UUID}.xml                         # status location 
        
               {WPS_OUTPUT_URL}/{JOB_UUID}.log                         # execution logs 
        
               {WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.ext}     # results of the job if successful

With potentially a nested rebase using the X-WPS-Output-Context:

weaver/docs/source/processes.rst

Lines 1951 to 1955 in c121f8d

    
           For example, providing ``X-WPS-Output-Context: project/test-1`` will result in outputs located at: 
        
           .. code-block:: 
        
               {WPS_OUTPUT_URL}/project/test-1/{JOB_UUID}/{outputID}/{output.ext}

With the integration of result transforms (#548), there would also be (where alt is any extension mapped by ext transforms):

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.alt}

With added provenance (#673), there is also a need for the following, with each possible original/transformed result to provide PROV metadata in various representations:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{output.[ext|alt]}.prov.[json|xml|rdf]

Another edge case is for an output application/directory, where the contents are:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{tree-dir|file}

With Zip (#726 or other container) transforms from a application/directory, there is an issue where there is no way to generate "output.zip" or the various {...}.prov.[json|xml|rdf] without potentially introducing a conflict of whatever contents the output directory contains. Furthermore, doing subsequent provenance or transform requests could end up nesting the prov/alt files under an archive (eg: output.tar.gz or output.prov.json zipped within a output.zip because they were requested before).

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/{tree-dir|file}
{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/output.zip
{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/output.prov.json

Possible Solutions

1. Using nested directories

One way to address all above would be to refactor the directory structure as follows:

{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/status/{JOB_UUID}.log
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/status/{JOB_UUID}.xml
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/outputs/{outputID}/{output.ext}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/transforms/{outputID}/{output.alt}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/prov/{outputID}/{output.prov.[json|xml|rdf]}

With hard coded status, outputs, transforms, prov sub-directories, we achieve multiple advantages:

allows extending contents related to results with any future capability/representation, simply by adding a new sub-directory as needed
reduces listing of the WPS-output directories that currently duplicate 3 entries each time: {JOB_UUID}.xml, {JOB_UUID}.log, {JOB_UUID}/, since they will all be under a single {JOB_UUID}/ directory
separating outputs from transforms make it clear which one is the original vs the generated alternate contents

Considerations

This change will cause any existing job to be unable to dynamically generate an alternate transform representation, because the nested /outputs/ would not be resolved.
Generating the base /outputs/ location will have to consider the various output storages (eg: PyWPS local dir store vs AWS S3 store) that have different ways to indicate the prefix location
Job preparation would have to set up the /status/ directory for the XML and logs. This might impact however the PyWPS workers are configured, and how this information is chained across the execution pipeline.
Workflows that assume certain nested dir locations with {outputID} would have to dynamically resolve the paths.

2. Adjusting the dir result

Another approach would be to preserve the current directory structure, but only adjust results using application/directory such that it is nested within another /dir/ sub-directory:

{WPS_OUTPUT_URL}/{JOB_UUID}/{outputID}/dir/{tree-dir|file}

This would also allow other metadata to be represented as:

{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir/{tree-dir|file}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir.{alt}
{WPS_OUTPUT_URL}[/{X-WPS-Output-Context}]/{JOB_UUID}/{outputID}/dir.prov.[json|xml|rdf]}

Considerations

Only application/directory outputs would need to be adjusted to consider the hardcoded /dir/.
Workflows would need to be updated, but his can be easily addressed since many parts of the code already have special handling for application/directory

References

relates to Job output transform #548
relates to Support CWL Prov with cwltool for OGC API - Processes IPT #673
relates to Support ZIP output #726
relates to Support for STAC metadata #103
(static directory output represented by another /stac/ subdir that nests the original results?)

The text was updated successfully, but these errors were encountered:

fmigneault added the triage/feature New requested feature. label Oct 24, 2024

fmigneault self-assigned this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor WPS outputs locations for alternate contents #745

Refactor WPS outputs locations for alternate contents #745

fmigneault commented Oct 24, 2024 •

edited

Loading

Refactor WPS outputs locations for alternate contents #745

Refactor WPS outputs locations for alternate contents #745

Comments

fmigneault commented Oct 24, 2024 • edited Loading

Description

Problem

Possible Solutions

1. Using nested directories

Considerations

2. Adjusting the dir result

Considerations

References

fmigneault commented Oct 24, 2024 •

edited

Loading