Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: Add the readiness endpoint for PD #5685

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

hanlins
Copy link

@hanlins hanlins commented Nov 8, 2022

What problem does this PR solve?

Added a readiness endpoint that exposes the information whether the corresponding etcd member of the PD is synced with the leader.

Issue Number: Close #5658

What is changed and how does it work?

Add the readiness endpoint for PD

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

Add a readiness check endpoint for checking etcd readiness status

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 8, 2022
@ti-chi-bot ti-chi-bot requested review from JmPotato and Yisaer November 8, 2022 03:12
@hanlins hanlins marked this pull request as draft November 8, 2022 03:13
@ti-chi-bot ti-chi-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 8, 2022
@nolouch nolouch requested review from disksing and rleungx and removed request for disksing and Yisaer November 8, 2022 03:18
@nolouch
Copy link
Contributor

nolouch commented Nov 8, 2022

I think once member down, the peer role still a follower, do not change to the learner.

@hanlins
Copy link
Author

hanlins commented Nov 8, 2022

I think once member down, the peer role still a follower, do not change to the learner.

Hi @nolouch, thanks for the feedback. Added a check for the lag between the leader's committed index and the current server's applied index, if the lag is larger than 5000, then the readiness check fails. What about that?

cc @rleungx

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Base: 75.72% // Head: 75.57% // Decreases project coverage by -0.16% ⚠️

Coverage data is based on head (652b70f) compared to base (c55e632).
Patch coverage: 57.69% of modified lines in pull request are covered.

❗ Current head 652b70f differs from pull request most recent head 277af51. Consider uploading reports for the commit 277af51 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5685      +/-   ##
==========================================
- Coverage   75.72%   75.57%   -0.16%     
==========================================
  Files         342      330      -12     
  Lines       34828    32986    -1842     
==========================================
- Hits        26373    24928    -1445     
+ Misses       6213     5906     -307     
+ Partials     2242     2152      -90     
Flag Coverage Δ
unittests 75.57% <57.69%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/api/etcd.go 54.16% <54.16%> (ø)
server/api/router.go 97.97% <100.00%> (+0.01%) ⬆️
server/api/diagnostic.go 76.92% <0.00%> (-23.08%) ⬇️
server/util.go 40.90% <0.00%> (-15.70%) ⬇️
server/keyspace/util.go 85.71% <0.00%> (-9.53%) ⬇️
server/tso/local_allocator.go 64.86% <0.00%> (-6.76%) ⬇️
server/schedule/hbstream/heartbeat_streams.go 72.72% <0.00%> (-6.07%) ⬇️
server/keyspace_service.go 55.55% <0.00%> (-5.64%) ⬇️
server/schedulers/transfer_witness_leader.go 76.08% <0.00%> (-5.31%) ⬇️
server/region_syncer/client.go 81.34% <0.00%> (-5.23%) ⬇️
... and 227 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@hanlins hanlins force-pushed the topic/5658/readiness-endpoint branch from 727e145 to c0f6996 Compare November 10, 2022 01:49
@hanlins
Copy link
Author

hanlins commented Nov 10, 2022

Some updates on the issue. I've updated the response payload schema to include richer information. Set up a TiDB cluster using the patch and it seems the response we got from PD is expected:

/ # ./curl-amd64 -v 127.0.0.1:2379/pd/api/v1/ready
*   Trying 127.0.0.1:2379...
* Connected to 127.0.0.1 (127.0.0.1) port 2379 (#0)
> GET /pd/api/v1/ready HTTP/1.1
> Host: 127.0.0.1:2379
> User-Agent: curl/7.86.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Access-Control-Allow-Headers: accept, content-type, authorization
< Access-Control-Allow-Methods: POST, GET, OPTIONS, PUT, DELETE
< Access-Control-Allow-Origin: *
< Content-Type: application/json; charset=UTF-8
< Date: Thu, 10 Nov 2022 21:23:57 GMT
< Content-Length: 70
<
{
  "is_learner": false,
  "curr_index": 610,
  "leader_index": 610
}
* Connection #0 to host 127.0.0.1 left intact

For my next steps, I'm going to work together with some of our users and see if it can resolve issues in their scenario, and will also update the test cases in parallel.

@hanlins
Copy link
Author

hanlins commented Nov 18, 2022

Added some basic tests and seems working:

➜  pd git:(topic/5658/readiness-endpoint) go test -v github.com/tikv/pd/server/api -run Etcd
=== RUN   TestEtcdTestSuite
...
=== RUN   TestEtcdTestSuite/TestEtcdReady
    etcd_test.go:58: response: '{IsLearner:false CurrIndex:28 LeaderIndex:28}'
    etcd_test.go:58: response: '{IsLearner:false CurrIndex:28 LeaderIndex:28}'
    etcd_test.go:58: response: '{IsLearner:false CurrIndex:29 LeaderIndex:29}'
...
[2022/11/18 20:42:13.501 +00:00] [INFO] [etcd.go:571] ["stopped serving peer traffic"] [address=127.0.0.1:33701]
[2022/11/18 20:42:13.501 +00:00] [INFO] [etcd.go:373] ["closed etcd server"] [name=pd1] [data-dir=/tmp/test_pd3505621251] [advertise-peer-urls="[http://127.0.0.1:33701]"] [advertise-client-urls="[http://127.0.0.1:44639]"]
[2022/11/18 20:42:13.502 +00:00] [INFO] [server.go:522] ["close server"]
--- PASS: TestEtcdTestSuite (13.55s)
    --- PASS: TestEtcdTestSuite/TestEtcdReady (0.16s)
PASS
ok  	github.com/tikv/pd/server/api	13.584s

Wondering do you have any concerns about the endpoint naming? The meaning for /ready could be a little vague, maybe we should rename it as /etcd/ready? Or just leave it as /ready for simplicity?

@hanlins hanlins marked this pull request as ready for review November 19, 2022 04:10
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2022
@wuhuizuo
Copy link
Contributor

/retest

@ti-chi-bot
Copy link
Member

@wuhuizuo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need readiness check endpoint for PD
4 participants