New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(data-warehouse): Added a new temporal workflow for compacting deltalake tables #28118

Open

Gilbert09 wants to merge 3 commits into master from tom/compaction-workflow

+254 −149

Member

Gilbert09 commented Jan 30, 2025

Problem

context thread on slack
tl;dr; deltalake compaction and vacuuming blows up our pod memory, switching to outsource this to a different pod and not in the critical path of the external-data-job workflow would be ideal

Changes

Spin up a new workflow
Trigger this workflow from the external-data-job workflow
TODO:
- Add some more unit tests for the new workflow

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

end to end tests include triggering the new workflow


          Added a new temporal workflow for compacting deltalake tables

601b82d

Gilbert09 requested a review from a team

January 30, 2025 21:05

greptile-apps bot reviewed

View reviewed changes

Contributor

greptile-apps bot left a comment

PR Summary

This PR introduces a new temporal workflow for compacting Delta Lake tables to address memory issues in pod execution. The changes move memory-intensive operations out of the critical path and into a separate workflow.

Added new DeltalakeCompactionJobWorkflow in deltalake_compaction_job.py with 5-minute timeout and no retries
Added DATA_WAREHOUSE_COMPACTION_TASK_QUEUE constant for dedicated compaction queue
Modified DeltaTableHelper to handle compaction through new compact_table() method with 24h retention
Added trigger_compaction_job utility with basic error handling for workflow coordination
Removed subprocess-based approach in delta_table_subprocess.py in favor of workflow solution

_{8 file(s) reviewed, 13 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

posthog/temporal/data_imports/deltalake_compaction_job.py

+              @activity.defn
+              def run_compaction(inputs: DeltalakeCompactionJobWorkflowInputs):
+                  logger = bind_temporal_worker_logger_sync(team_id=inputs.team_id)
+                  job = ExternalDataJob.objects.get(id=inputs.external_data_job_id, team_id=inputs.team_id)

Contributor

greptile-apps bot Jan 30, 2025

logic: No error handling for when job doesn't exist. Should wrap in try/except to handle ObjectDoesNotExist gracefully.

Suggested change

      
                job = ExternalDataJob.objects.get(id=inputs.external_data_job_id, team_id=inputs.team_id)
          
                try:
          
                    job = ExternalDataJob.objects.get(id=inputs.external_data_job_id, team_id=inputs.team_id)
          
                except ExternalDataJob.DoesNotExist:
          
                    logger.error("External data job not found", external_data_job_id=inputs.external_data_job_id)
          
                    raise

posthog/temporal/data_imports/deltalake_compaction_job.py

+                      await workflow.execute_activity(
+                          run_compaction,
+                          inputs,
+                          start_to_close_timeout=dt.timedelta(minutes=5),

Contributor

greptile-apps bot Jan 30, 2025

style: 5 minute timeout may be too short for large tables. Consider making this configurable or increasing default.

posthog/temporal/data_imports/deltalake_compaction_job.py

Comment on lines +45 to +47

+                          retry_policy=RetryPolicy(
+                              maximum_attempts=1,
+                          ),

Contributor

greptile-apps bot Jan 30, 2025

style: Single attempt with no retries could lead to transient failures. Consider allowing retries with backoff for recoverable errors.

posthog/temporal/data_imports/deltalake_compaction_job.py


		delta_table_helper = DeltaTableHelper(resource_name=schema.name, job=job, logger=logger)

		delta_table_helper.compact_table()

Contributor

greptile-apps bot Jan 30, 2025

logic: compact_table() errors are not caught or logged. Should handle potential DeltaLake errors.

posthog/temporal/data_imports/deltalake_compaction_job.py

+                  logger = bind_temporal_worker_logger_sync(team_id=inputs.team_id)
+                  job = ExternalDataJob.objects.get(id=inputs.external_data_job_id, team_id=inputs.team_id)
+                  assert job.schema is not None

Contributor

greptile-apps bot Jan 30, 2025

style: Using assert for runtime checks is not recommended in production code. Replace with proper validation.

Suggested change

      
                assert job.schema is not None
          
                if job.schema is None:
          
                    logger.error("Job schema is None", job_id=job.id)
          
                    raise ValueError(f"Job {job.id} has no associated schema")

posthog/temporal/data_imports/pipelines/pipeline/pipeline.py

Comment on lines +164 to +165

		compaction_job_id = trigger_compaction_job(self._job, self._schema)
		self._logger.debug(f"Compaction workflow id: {compaction_job_id}")

Contributor

greptile-apps bot Jan 30, 2025

logic: No error handling around trigger_compaction_job. Should catch and log potential failures to ensure main workflow completion.

posthog/temporal/data_imports/pipelines/pipeline/utils.py

Comment on lines +291 to +292

		except WorkflowAlreadyStartedError:
		pass

Contributor

greptile-apps bot Jan 30, 2025

logic: silently ignoring WorkflowAlreadyStartedError could hide important issues - should at least log this case

posthog/temporal/data_imports/pipelines/pipeline/utils.py

Comment on lines +285 to +288

+                              retry_policy=RetryPolicy(
+                                  maximum_attempts=1,
+                                  non_retryable_error_types=["NondeterminismError"],
+                              ),

Contributor

greptile-apps bot Jan 30, 2025

style: retry policy is very restrictive - consider adding backoff strategy and increasing maximum attempts for transient failures

posthog/temporal/data_imports/pipelines/pipeline/utils.py

+              def trigger_compaction_job(job: ExternalDataJob, schema: ExternalDataSchema) -> str:
+                  temporal = sync_connect()
+                  workflow_id = f"{schema.id}-compaction"

Contributor

greptile-apps bot Jan 30, 2025

style: workflow_id should include more uniqueness guarantees, like job.id or timestamp to prevent conflicts

Suggested change

      
                workflow_id = f"{schema.id}-compaction"
          
                workflow_id = f"{schema.id}-{job.id}-compaction"

posthog/temporal/data_imports/pipelines/pipeline/utils.py

Comment on lines +277 to +290

+                      asyncio.run(
+                          temporal.start_workflow(
+                              workflow="deltalake-compaction-job",
+                              arg=dataclasses.asdict(
+                                  DeltalakeCompactionJobWorkflowInputs(team_id=job.team_id, external_data_job_id=job.id)
+                              ),
+                              id=workflow_id,
+                              task_queue=str(DATA_WAREHOUSE_COMPACTION_TASK_QUEUE),
+                              retry_policy=RetryPolicy(
+                                  maximum_attempts=1,
+                                  non_retryable_error_types=["NondeterminismError"],
+                              ),
+                          )
+                      )

Contributor

greptile-apps bot Jan 30, 2025

logic: asyncio.run() should have a timeout to prevent hanging

EDsCODE approved these changes

View reviewed changes

Member

EDsCODE left a comment

lgtm! consider the comments from slack thread https://posthog.slack.com/archives/C019RAX2XBN/p1738276797709289

Gilbert09 added 2 commits

January 31, 2025 16:21


          Added a happy path test for the new workflow

eeb5668


          Mypy update

b511a2a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet