Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a better way of running reprocessing of data #109

Closed
hellais opened this issue Jan 17, 2025 · 0 comments · Fixed by #112
Closed

Come up with a better way of running reprocessing of data #109

hellais opened this issue Jan 17, 2025 · 0 comments · Fixed by #112
Assignees
Labels

Comments

@hellais
Copy link
Member

hellais commented Jan 17, 2025

When we improve analysis or add new tests we would like to re-run analysis and observation generation when logic changes.

The observation generation workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/observations.py#L198) generally should only be re-run in these cases:

  • When we detect a bug in the observation generation logic. In this case we want to reprocess all the data
  • When we add support for a new previously unsupported experiment. In this case we might reprocess it only since the introduction of the experiment. This is the case for echcheck see Add support for parsing echcheck related metadata in pipeline #108

The analysis workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/analysis.py#L20) will be re-run more often since the analysis could improve over time and new fingerprints might be added.

There is a dependency relationship between observation and analysis mapped through the DAG (https://github.com/ooni/data/blob/main/dags/pipeline.py). Whenever we run observation we also should re-run analysis.

Moreover, we need to consider that whenever we re-run any of these workflows we might be creating duplicate rows, that because clickhouse is eventually consistent do not get deduplicated until a merge occurs. We have configured the partition keys for the table to include the prefix YYYYMM of the bucket_date in the case of observation tables and YYYYMM of the measurement_start_time for the analysis tables. This allows us to trigger an OPTIMIZE TABLE PARTITION YYYYMM to force deduplication and have it run more performantly. This should be factored into the reprocessing logic.

@hellais hellais added this to Roadmap Jan 17, 2025
@hellais hellais moved this to Sprint Backlog in Roadmap Jan 17, 2025
@hellais hellais assigned hellais and unassigned hynnot Jan 23, 2025
@hellais hellais moved this from Sprint Backlog to In Progress in Roadmap Jan 23, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap Jan 24, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants