Come up with a better way of running reprocessing of data #109

hellais · 2025-01-17T12:42:19Z

When we improve analysis or add new tests we would like to re-run analysis and observation generation when logic changes.

The observation generation workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/observations.py#L198) generally should only be re-run in these cases:

When we detect a bug in the observation generation logic. In this case we want to reprocess all the data
When we add support for a new previously unsupported experiment. In this case we might reprocess it only since the introduction of the experiment. This is the case for echcheck see Add support for parsing echcheck related metadata in pipeline #108

The analysis workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/analysis.py#L20) will be re-run more often since the analysis could improve over time and new fingerprints might be added.

There is a dependency relationship between observation and analysis mapped through the DAG (https://github.com/ooni/data/blob/main/dags/pipeline.py). Whenever we run observation we also should re-run analysis.

Moreover, we need to consider that whenever we re-run any of these workflows we might be creating duplicate rows, that because clickhouse is eventually consistent do not get deduplicated until a merge occurs. We have configured the partition keys for the table to include the prefix YYYYMM of the bucket_date in the case of observation tables and YYYYMM of the measurement_start_time for the analysis tables. This allows us to trigger an OPTIMIZE TABLE PARTITION YYYYMM to force deduplication and have it run more performantly. This should be factored into the reprocessing logic.

The text was updated successfully, but these errors were encountered:

hellais added funder/drl2022-2024 priority/medium Normal priority issue labels Jan 17, 2025

hellais assigned hynnot Jan 17, 2025

hellais added this to Roadmap Jan 17, 2025

hellais moved this to Sprint Backlog in Roadmap Jan 17, 2025

This was referenced Jan 23, 2025

Autodetection - reprocessing ooni/ooni.org#1451

Closed

Reprocess all historical data ooni/backend#515

Closed

hellais assigned hellais and unassigned hynnot Jan 23, 2025

hellais moved this from Sprint Backlog to In Progress in Roadmap Jan 23, 2025

hellais mentioned this issue Jan 23, 2025

Add support for hourly runs and reprocessing #112

Merged

hellais closed this as completed in #112 Jan 24, 2025

hellais closed this as completed in 40ecc64 Jan 24, 2025

github-project-automation bot moved this from In Progress to Done in Roadmap Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a better way of running reprocessing of data #109

Come up with a better way of running reprocessing of data #109

hellais commented Jan 17, 2025 •

edited

Loading

Come up with a better way of running reprocessing of data #109

Come up with a better way of running reprocessing of data #109

Comments

hellais commented Jan 17, 2025 • edited Loading

hellais commented Jan 17, 2025 •

edited

Loading