-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tenants on standby #4218
base: main
Are you sure you want to change the base?
Tenants on standby #4218
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a tenant pre-provisioning system that maintains a pool of ready-to-use tenants, improving user registration performance by having pre-configured tenants available for immediate assignment.
- Added
NewAvailableTenant
model and migration in/backend/alembic_tenants/versions/3b45e0018bf1_add_new_available_tenant_table.py
to track pre-provisioned tenants - Implemented periodic Celery tasks in
/backend/onyx/background/celery/tasks/periodic/tenant_provisioning.py
to maintain target number of available tenants (default 5) - Modified
/backend/ee/onyx/server/tenants/provisioning.py
to prioritize using pre-provisioned tenants before creating new ones - Added Redis locks (
CHECK_AVAILABLE_TENANTS_LOCK
,PRE_PROVISION_TENANT_LOCK
) in/backend/onyx/configs/constants.py
to prevent task overlap - Added
get_current_alembic_version()
in/backend/ee/onyx/server/tenants/schema_management.py
to track tenant schema versions
9 file(s) reviewed, 7 comment(s)
Edit PR Review Bot Settings | Greptile
backend/alembic_tenants/versions/3b45e0018bf1_add_new_available_tenant_table.py
Show resolved
Hide resolved
backend/onyx/background/celery/tasks/periodic/tenant_provisioning.py
Outdated
Show resolved
Hide resolved
0c649c0
to
e361e69
Compare
e361e69
to
44cf923
Compare
if not MULTI_TENANT: | ||
return None | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really looks like there are one or more race conditions here. We need to be able to query and assign a tenant atomically, or retry an operation until everything is in a verified state.
add_users_to_tenant([email], tenant_id) | ||
|
||
# Create milestone record | ||
with get_session_with_tenant(tenant_id=tenant_id) as db_session: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writing to the db is error prone ... what happens if we crash and don't notify the control plane?
|
||
# Trigger pre-provisioning tasks for each tenant needed | ||
for _ in range(tenants_to_provision): | ||
pre_provision_tenant.apply_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've moved to send task nearly everywhere due to circular dependency issues with directly applying tasks. Would prefer to stay away from apply async unless there's a good reason for it.
) | ||
|
||
except Exception as e: | ||
task_logger.exception(f"Error in check_available_tenants task: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{e} is redundant with logger.exception
so it's ready to be assigned to a user immediately. | ||
""" | ||
task_logger.info("STARTING PRE_PROVISION_TENANT") | ||
if not MULTI_TENANT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to stop gating functions with globals. usually we can gate outside them by simply not calling them.
) | ||
return | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it definitely looks problematic to be running this series of operations not in a single transaction or lock. Have we thought about this yet?
task_logger.info(f"Successfully pre-provisioned tenant {tenant_id}") | ||
|
||
except Exception as e: | ||
task_logger.exception(f"Error in pre_provision_tenant task: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{e} redundant.
Description
Maintain a list of ready tenants on standby to provision new tenants / users into. Mitigates the 9+ second latency on first-time registration.
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.