Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tenants on standby #4218

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Tenants on standby #4218

wants to merge 11 commits into from

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Mar 6, 2025

Description

Maintain a list of ready tenants on standby to provision new tenants / users into. Mitigates the 9+ second latency on first-time registration.

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@pablonyx pablonyx requested a review from a team as a code owner March 6, 2025 18:28
Copy link

vercel bot commented Mar 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 7, 2025 1:29am

@pablonyx pablonyx marked this pull request as draft March 6, 2025 18:28
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a tenant pre-provisioning system that maintains a pool of ready-to-use tenants, improving user registration performance by having pre-configured tenants available for immediate assignment.

  • Added NewAvailableTenant model and migration in /backend/alembic_tenants/versions/3b45e0018bf1_add_new_available_tenant_table.py to track pre-provisioned tenants
  • Implemented periodic Celery tasks in /backend/onyx/background/celery/tasks/periodic/tenant_provisioning.py to maintain target number of available tenants (default 5)
  • Modified /backend/ee/onyx/server/tenants/provisioning.py to prioritize using pre-provisioned tenants before creating new ones
  • Added Redis locks (CHECK_AVAILABLE_TENANTS_LOCK, PRE_PROVISION_TENANT_LOCK) in /backend/onyx/configs/constants.py to prevent task overlap
  • Added get_current_alembic_version() in /backend/ee/onyx/server/tenants/schema_management.py to track tenant schema versions

9 file(s) reviewed, 7 comment(s)
Edit PR Review Bot Settings | Greptile

if not MULTI_TENANT:
return None

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really looks like there are one or more race conditions here. We need to be able to query and assign a tenant atomically, or retry an operation until everything is in a verified state.

add_users_to_tenant([email], tenant_id)

# Create milestone record
with get_session_with_tenant(tenant_id=tenant_id) as db_session:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writing to the db is error prone ... what happens if we crash and don't notify the control plane?


# Trigger pre-provisioning tasks for each tenant needed
for _ in range(tenants_to_provision):
pre_provision_tenant.apply_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've moved to send task nearly everywhere due to circular dependency issues with directly applying tasks. Would prefer to stay away from apply async unless there's a good reason for it.

)

except Exception as e:
task_logger.exception(f"Error in check_available_tenants task: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{e} is redundant with logger.exception

so it's ready to be assigned to a user immediately.
"""
task_logger.info("STARTING PRE_PROVISION_TENANT")
if not MULTI_TENANT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to stop gating functions with globals. usually we can gate outside them by simply not calling them.

)
return

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it definitely looks problematic to be running this series of operations not in a single transaction or lock. Have we thought about this yet?

task_logger.info(f"Successfully pre-provisioned tenant {tenant_id}")

except Exception as e:
task_logger.exception(f"Error in pre_provision_tenant task: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{e} redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants