You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Moreover, we need to consider that whenever we re-run any of these workflows we might be creating duplicate rows, that because clickhouse is eventually consistent do not get deduplicated until a merge occurs. We have configured the partition keys for the table to include the prefix YYYYMM of the bucket_date in the case of observation tables and YYYYMM of the measurement_start_time for the analysis tables. This allows us to trigger an OPTIMIZE TABLE PARTITION YYYYMM to force deduplication and have it run more performantly. This should be factored into the reprocessing logic.
The text was updated successfully, but these errors were encountered:
When we improve analysis or add new tests we would like to re-run analysis and observation generation when logic changes.
The observation generation workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/observations.py#L198) generally should only be re-run in these cases:
The analysis workflow (https://github.com/ooni/data/blob/main/oonipipeline/src/oonipipeline/tasks/analysis.py#L20) will be re-run more often since the analysis could improve over time and new fingerprints might be added.
There is a dependency relationship between observation and analysis mapped through the DAG (https://github.com/ooni/data/blob/main/dags/pipeline.py). Whenever we run observation we also should re-run analysis.
Moreover, we need to consider that whenever we re-run any of these workflows we might be creating duplicate rows, that because clickhouse is eventually consistent do not get deduplicated until a merge occurs. We have configured the partition keys for the table to include the prefix
YYYYMM
of the bucket_date in the case of observation tables andYYYYMM
of themeasurement_start_time
for the analysis tables. This allows us to trigger anOPTIMIZE TABLE PARTITION YYYYMM
to force deduplication and have it run more performantly. This should be factored into the reprocessing logic.The text was updated successfully, but these errors were encountered: