Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various problems with rollout dashboard's cache and task processing. #55

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

DFINITYManu
Copy link
Collaborator

As per https://dfinity.atlassian.net/browse/DRE-303 and https://dfinity.atlassian.net/browse/DRE-304 , in certain cases where the administrator has intervened in the rollout (via clearing tasks for retry, or marking them as failed, or marking them as successful), the cache gets out of synchronization with the actual Airflow state, because these task changes do not update the task dates (which is what the cache relies on, to prevent transferring tens of megabytes every 5 seconds).

This implements the use of an incremental log parser that fishes out tasks updated in the last update window. With that, we can deduce when a rollout needs full task update or just certain tasks.

Alas, some tasks do not log updates even when they execute (chiefly the tasks implemented by Airflow logic and null operators), so we still must hit Airflow with an incremental "after this date" query (actually, 3) to obtain that information. We may be able to optimize this to three queries for N rollouts rather than three * N, but that is not in the cards for this PR.

This has been verified manually through replication of the error conditions in a local Airflow instance.

@DFINITYManu DFINITYManu requested a review from a team as a code owner October 23, 2024 17:30
As per https://dfinity.atlassian.net/browse/DRE-303 and
https://dfinity.atlassian.net/browse/DRE-304 , in certain cases where the
administrator has intervened in the rollout (via clearing tasks for retry,
or marking them as failed, or marking them as successful), the cache gets
out of synchronization with the actual Airflow state, because these task
changes do not update the task dates (which is what the cache relies on,
to prevent transferring tens of megabytes every 5 seconds).

This implements the use of an incremental log parser that fishes out
tasks updated in the last update window.  With that, we can deduce when
a rollout needs full task update or just certain tasks.

Alas, some tasks do not log updates even when they execute (chiefly the
tasks implemented by Airflow logic and null operators), so we still must
hit Airflow with an incremental "after this date" query (actually, 3)
to obtain that information.  We may be able to optimize this to three
queries for N rollouts rather than three * N, but that is not in the
cards for this PR.

This has been verified manually through replication of the error conditions
in a local Airflow instance.
@DFINITYManu DFINITYManu merged commit 770efc6 into main Oct 23, 2024
5 checks passed
@DFINITYManu DFINITYManu deleted the coolpr branch October 23, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants