-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154
ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154
Conversation
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
Hi @YuJuncen. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
With this change, what is the maximum time, evict leader scheduler (and other schedulers) can remain paused. If a tikv is already restarted, then init pod will wait for it come up, and then suspend lightning. And during this time, all schedulers/gc/suspend, will be paused right. |
If every requests to suspend lightning failed immediately, we will keep retry for about 10 mins. For some call that stuck, we may cost more time over it. If the GC stop time is essential, perhaps we can make the retry based on time cost over failed requests instead of failure count.
Yes. |
/retesst |
Signed-off-by: hillium <[email protected]>
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #49154 +/- ##
=================================================
- Coverage 71.8223% 53.7478% -18.0745%
=================================================
Files 1444 1549 +105
Lines 346984 583242 +236258
=================================================
+ Hits 249212 313480 +64268
- Misses 77425 245816 +168391
- Partials 20347 23946 +3599
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Yeah. Infact, can we make the max pause (gc/schedulers/import) duration configurable. We don't want to pause them more than X (lets say 10 mins). And if during that time, if we cannot pause all tikvs, then its ok to fail the backup. But yeah, retry based on time limit will be ideal. But for implementation, its ok to use retries with exponential backup, and break after a certain time. A bit outside the scope of this PR. But do we have retries around ebs snapshot trigger (done within backup pod). If create-snapshot api gets throttled, whats the max retry time. Asking since during time, the init pod (and hence pause) will be active. Ok taking this discussion offline (on slack). |
Signed-off-by: hillium <[email protected]>
/retest |
@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test check-dev |
@YuJuncen: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This reverts commit ccdd36b.
Signed-off-by: hillium <[email protected]>
/retest-required |
Signed-off-by: hillium <[email protected]>
Signed-off-by: hillium <[email protected]>
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
…kup (pingcap#49154) (pingcap#50444) (pingcap#37) close pingcap#49152, close pingcap#49153 Co-authored-by: Ti Chi Robot <[email protected]>
In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/tidb#release-7.1 from head ti-chi-bot:cherry-pick-49154-to-release-7.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-49154-to-release-7.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request"} |
Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #49152, close #49153
Problem Summary:
See the issue.
For #49152, we didn't add retry for starting suspending lightning.
For #49153, we just break the loop when keeper encounters errors, this may cause the final consistency check passes because of the request of extend lease.
What changed and how does it work?
Fixed the problems above by retry and fail fast.
Check List
Tests
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.