Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45988: Support storing init-outputs in central repo #256

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

kfindeisen
Copy link
Member

This PR adds a new program for one-time creation of pipeline init-outputs as a Kubernetes job.

@kfindeisen kfindeisen force-pushed the tickets/DM-45988 branch 8 times, most recently from 538262e to 1a8b660 Compare February 24, 2025 21:04
@kfindeisen kfindeisen force-pushed the tickets/DM-45988 branch 3 times, most recently from c11bc55 to aa4247d Compare February 26, 2025 01:42
The reference to daf_butler.cli.cliLog.CliLog works fine in some
contexts, but breaks in others.
The output was annotated as an iterable, specced as a sequence with
"undefined order" (which defeats the main benefit of using a sequence),
and implemented as a list with possible duplicates. I've changed the
docs to self-consistently say that it's a collection of arbitrary type,
and implemented as a set internally (though the spec still does not
guarantee deduplication).
In PipelinesConfig, "all_pipelines" is used to mean "any pipeline that
might be run for any visit". Since the MiddlewareInterface method
merely returns all pipelines that are applicable for a specific visit
(whether preprocessing or primary), _get_combined_pipeline_files is
less confusing.
This method returns _all_ pipelines that could possibly be run under
this configuration, which is useful for visit-independent
preparatory work.
The script takes the same environment variables (with the same
defaults) as the activator, but does not run as a service and does not
use any visit-specific information.
Running Gunicorn already implicitly adds `/app` to the Python path, but
making it explicit will make the container's behavior more consistent
and reliable.
The action now takes an optional input, dockerDir, which allows builds
of containers other than ${PROMPT_PROCESSING_DIR}/Dockerfile.
Unlike the service container, the initializer container runs a single
script and then exits, so it is best suited for Jobs or similar
objects.
The chain needs to be defined at least once (instead of exactly once
like the outputs themselves), but doing it in Prompt Processing leads
to race conditions. The script already needs to define all the runs
that would go into the chain.
Output chains are now handled more reliably by the init-output job.
We consistently use the same pattern to handle mocks initialized in
setUp; create a subclass of TestCase that does this on demand.
I'd also like to count DatasetRefs by other properties, and the code
for doing so is almost completely identical.
Now that init-outputs are created once in the central repo, we should
use those for all processing to ensure consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant