-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need readiness check endpoint for PD #5658
Comments
Proposing to add a restful endpoint which is similar to the health endpoint to expose its etcd readiness information
The logic for readiness check would be checking if the current PD's corresponding etcd member is still a cc @nolouch Footnotes |
IMO, when we do a rolling update, there also could be a log lag between followers and the leader, which cannot be solved by only checking the learner. |
Sure, thanks for pointing that out! Any suggestions on what should we check? |
Can we just use health API? |
Was checking Lines 2407 to 2432 in 91f1664
I think in k8s rolling upgrade, we would be interested in whether the current PD instance has been keeping up with the leader. What about also checking the applied index for current PD against the leader's committed index? |
Also for the Line 73 in 01b8f34
In k8s we typically need the endpoint to return a status code larger than 400 if a status check probe fails1, quote:
Footnotes |
Feature Request
Describe your feature request related problem
When doing a rolling update for a PD cluster, we need to turn down a PD instance, then bring up a PD instance with the updated config, and repeat this process for all PD instances one at a time. Typically we use tidb-operator for automating this, and a PD instance is considered "Running" as long as the pod is up for some duration. However, it could be the PD instance is still picking up with the leader and receiving etcd raft logs, and cannot serve the request at the moment. If let's say we have a deployment of PDs that has 3 replicas, the first two pods are updated but not ready for serving new requests, then at this point the PD is unavailable since the remaining PD cannot write quorum (as the other two are still syncing etcd).
Describe the feature you'd like
It would be helpful if PD instance can expose info (maybe via a restful endpoint) whether its internal data is synced with leader and ready for serving new requests.
Describe alternatives you've considered
An alternative approach in tidb-operator scenario is to add a init container in PD's pod spec that sleeps for certain time. Before sleep is done, the Pod will stuck in
Init
state so the rolling update process will wait til the sleep completes. Hopefully, during this time, the etcd data for that PD instance is synced.Teachability, Documentation, Adoption, Migration Strategy
The endpoint should be something new, and since old system is depending on it, we don't need to worry about the migration.
The text was updated successfully, but these errors were encountered: