Fix various problems with rollout dashboard's cache and task processing. #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As per https://dfinity.atlassian.net/browse/DRE-303 and https://dfinity.atlassian.net/browse/DRE-304 , in certain cases where the administrator has intervened in the rollout (via clearing tasks for retry, or marking them as failed, or marking them as successful), the cache gets out of synchronization with the actual Airflow state, because these task changes do not update the task dates (which is what the cache relies on, to prevent transferring tens of megabytes every 5 seconds).
This implements the use of an incremental log parser that fishes out tasks updated in the last update window. With that, we can deduce when a rollout needs full task update or just certain tasks.
Alas, some tasks do not log updates even when they execute (chiefly the tasks implemented by Airflow logic and null operators), so we still must hit Airflow with an incremental "after this date" query (actually, 3) to obtain that information. We may be able to optimize this to three queries for N rollouts rather than three * N, but that is not in the cards for this PR.
This has been verified manually through replication of the error conditions in a local Airflow instance.