Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment #4009

meeech · 2024-12-18T02:45:40Z

I am starting this ticket to capture the information as I investigate and try to resolve this issue.
If anyone has any pointers or thoughts, please add them.

When an analysis run fail happens and that analysis run is part of an inline experiment step, we don't get the on-analysis-run-error or on-analysis-run-fail notification.

Analysis Run Error
✅ Background Analysis Run: event: AnalysisRunError object: rollout/basic-rollout
❌ Inline Step Analysis Run: event: AnalysisRunError object: experiment/basic-rollout-exp-steps-b66774df5-3-0

We get the RolloutAborted notification for both, because the event that fires belongs to the rollout/* object in both cases

Analysis Run Fail
✅ Background Analysis Run: event: AnalysisRunFailed object: rollout/basic-rollout
❌ Inline Step Analysis Run: event: AnalysisRunFailed object: experiment/basic-rollout-exp-steps-bd7bdfcc8-4-0

We get the RolloutAborted notification for both, because the event that fires belongs to the rollout/* object in both cases

So this has me thinking theres a few possible options:

with the notif engine, we don't give it access to the experiment object events?
Or is this a case where the events are being fired off the wrong object - where we use the experiment EventRecorder, when we should find the parent(?) rollout object and use its EventRecorder?

I'll keep digging. Unsure what the ideal would be:
would we like something like on-experiment-analysis-run-failed, on-experiment-analysis-run-error... or would things be better served with them using the already existing triggers? I think when its a step it would make sense to use the existing triggers, and have the rollout object available for the templates, but what about stand alone experiments?

Version

1.7.2 (but this has existed as a problem as long as I've been using experiment step, so at least 1.5/1.6

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

meeech · 2024-12-19T03:24:48Z

More rough notes from a conversation with @zachaller:

Rollouts uses the notification engine in 2 ways

As a library (handles on-event triggers)
As a controller (handles when triggers)

Path to explore

Create a new notification controller / deployment - this is similar to what Argo CD does.

Implementing a separate notification controller as it's own deployment you would have to recreate an event listener translator function like the one in the rollouts controller (https://github.com/argoproj/argo-rollouts/blob/master/utils%2Frecord%2Frecord.go#L373)
By default notification engine just add a k8s watch to a kind then runs the evaluation engine on it, within rollouts we have this on k8s event system that fires notifications via code not from the informer
ArgoCD does not work that way only Rollouts does which somewhat makes sense for ease of use.
Upstream notification engine doesn't support multiple kinds in one controller, so would maybe require multiple config maps as well to config
It may be easier path to make a new controller. (example of making a new one with notif engine https://github.com/argoproj/notifications-engine/blob/master/examples%2Fcertmanager%2Fcontroller%2Fmain.go)

…he Experiment Addresses argoproj#4009. This change will fire Analysis Run events bound to the parent Rollout object when the Experiment is a Step in the Rollout.

…he Experiment Addresses argoproj#4009. This change will fire Analysis Run events bound to the parent Rollout object when the Experiment is a Step in the Rollout. Signed-off-by: mitchell amihod <[email protected]>

* chore: ignore all debug_bin* Signed-off-by: mitchell amihod <[email protected]> * feat(experiments): Add a utility to check if an experiment belongs to a Rollout. We can then identify when an experiment is a Step in a Rollout. Signed-off-by: mitchell amihod <[email protected]> * chore: typo Signed-off-by: mitchell amihod <[email protected]> * feat(experiments): Fire k8s Event bound to the Rollout when it owns the Experiment Addresses #4009. This change will fire Analysis Run events bound to the parent Rollout object when the Experiment is a Step in the Rollout. Signed-off-by: mitchell amihod <[email protected]> * Loop through ownerReferences to find the rollout reference. If we pass belongs to rollout check, we know there is a rollout to find. Signed-off-by: mitchell amihod <[email protected]> * Tighten things up - don't need a bool - fetch the ref or nil Signed-off-by: mitchell amihod <[email protected]> --------- Signed-off-by: mitchell amihod <[email protected]>

meeech added the bug Something isn't working label Dec 18, 2024

meeech mentioned this issue Feb 9, 2025

fix(experiments): fire rollout event on experiment step #4124

Merged

6 tasks

meeech changed the title ~~Notifications not firing for Analysis Run fails when Analysis Run is part of an Expriment~~ Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment #4009

Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment #4009

meeech commented Dec 18, 2024

meeech commented Dec 19, 2024

Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment #4009

Notifications not firing for Analysis Run fails when Analysis Run is part of an Experiment #4009

Comments

meeech commented Dec 18, 2024

meeech commented Dec 19, 2024