DM-45988: Support storing init-outputs in central repo #256

kfindeisen · 2025-02-13T19:29:26Z

This PR adds a new program for one-time creation of pipeline init-outputs as a Kubernetes job.

The reference to daf_butler.cli.cliLog.CliLog works fine in some contexts, but breaks in others.

The output was annotated as an iterable, specced as a sequence with "undefined order" (which defeats the main benefit of using a sequence), and implemented as a list with possible duplicates. I've changed the docs to self-consistently say that it's a collection of arbitrary type, and implemented as a set internally (though the spec still does not guarantee deduplication).

In PipelinesConfig, "all_pipelines" is used to mean "any pipeline that might be run for any visit". Since the MiddlewareInterface method merely returns all pipelines that are applicable for a specific visit (whether preprocessing or primary), _get_combined_pipeline_files is less confusing.

This method returns _all_ pipelines that could possibly be run under this configuration, which is useful for visit-independent preparatory work.

The script takes the same environment variables (with the same defaults) as the activator, but does not run as a service and does not use any visit-specific information.

Running Gunicorn already implicitly adds `/app` to the Python path, but making it explicit will make the container's behavior more consistent and reliable.

The action now takes an optional input, dockerDir, which allows builds of containers other than ${PROMPT_PROCESSING_DIR}/Dockerfile.

Unlike the service container, the initializer container runs a single script and then exits, so it is best suited for Jobs or similar objects.

The chain needs to be defined at least once (instead of exactly once like the outputs themselves), but doing it in Prompt Processing leads to race conditions. The script already needs to define all the runs that would go into the chain.

Output chains are now handled more reliably by the init-output job.

We consistently use the same pattern to handle mocks initialized in setUp; create a subclass of TestCase that does this on demand.

I'd also like to count DatasetRefs by other properties, and the code for doing so is almost completely identical.

Now that init-outputs are created once in the central repo, we should use those for all processing to ensure consistency.

kfindeisen force-pushed the tickets/DM-45988 branch 8 times, most recently from 538262e to 1a8b660 Compare February 24, 2025 21:04

kfindeisen force-pushed the tickets/DM-45988 branch 3 times, most recently from c11bc55 to aa4247d Compare February 26, 2025 01:42

kfindeisen added 14 commits February 27, 2025 13:26

Fix inconsistent daf_butler import in logger.

13d3b7d

The reference to daf_butler.cli.cliLog.CliLog works fine in some contexts, but breaks in others.

Fix docstrings for MiddlewareInterface export methods.

c4b5bec

Implement PipelinesConfig.get_all_pipeline_files.

9b1261f

This method returns _all_ pipelines that could possibly be run under this configuration, which is useful for visit-independent preparatory work.

Add standalone script for generating init-outputs.

f07201c

The script takes the same environment variables (with the same defaults) as the activator, but does not run as a service and does not use any visit-specific information.

Add explicit PYTHONPATH to Docker container.

f658e6c

Running Gunicorn already implicitly adds `/app` to the Python path, but making it explicit will make the container's behavior more consistent and reliable.

Generalize update-dev-image action.

d4090db

The action now takes an optional input, dockerDir, which allows builds of containers other than ${PROMPT_PROCESSING_DIR}/Dockerfile.

Create Docker container for initializer.

a4f3a9a

Unlike the service container, the initializer container runs a single script and then exits, so it is best suited for Jobs or similar objects.

Remove on-export chaining from MiddlewareInterface.

2322528

Output chains are now handled more reliably by the init-output job.

Factor mock management out of individual test cases.

c2e8be9

We consistently use the same pattern to handle mocks initialized in setUp; create a subclass of TestCase that does this on demand.

Generalize MiddlewareInterface._count_by_type.

f53f362

I'd also like to count DatasetRefs by other properties, and the code for doing so is almost completely identical.

Import init-outputs in local repo.

f797d8f

Now that init-outputs are created once in the central repo, we should use those for all processing to ensure consistency.

kfindeisen force-pushed the tickets/DM-45988 branch from aa4247d to f797d8f Compare February 27, 2025 21:26

Fixup to "Import init-outputs"

94fb292

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-45988: Support storing init-outputs in central repo #256

DM-45988: Support storing init-outputs in central repo #256

kfindeisen commented Feb 13, 2025

DM-45988: Support storing init-outputs in central repo #256

Are you sure you want to change the base?

DM-45988: Support storing init-outputs in central repo #256

Conversation

kfindeisen commented Feb 13, 2025