ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

YuJuncen · 2023-12-04T12:01:56Z

What problem does this PR solve?

Issue Number: close #49152, close #49153

Problem Summary:
See the issue.
For #49152, we didn't add retry for starting suspending lightning.
For #49153, we just break the loop when keeper encounters errors, this may cause the final consistency check passes because of the request of extend lease.

What changed and how does it work?

Fixed the problems above by retry and fail fast.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fixed a bug that may cause EBS snapshot backup not work properly with TiKV outage.

Signed-off-by: hillium <[email protected]>

tiprow · 2023-12-04T12:02:13Z

Hi @YuJuncen. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: hillium <[email protected]>

nkg- · 2023-12-06T02:41:01Z

With this change, what is the maximum time, evict leader scheduler (and other schedulers) can remain paused. If a tikv is already restarted, then init pod will wait for it come up, and then suspend lightning. And during this time, all schedulers/gc/suspend, will be paused right.

YuJuncen · 2023-12-06T02:58:01Z

With this change, what is the maximum time, evict leader scheduler (and other schedulers) can remain paused.

If every requests to suspend lightning failed immediately, we will keep retry for about 10 mins. For some call that stuck, we may cost more time over it. If the GC stop time is essential, perhaps we can make the retry based on time cost over failed requests instead of failure count.

If a tikv is already restarted, then init pod will wait for it come up, and then suspend lightning. And during this time, all schedulers/gc/suspend, will be paused right.

Yes.

BornChanger · 2023-12-06T06:15:16Z

/retesst

Signed-off-by: hillium <[email protected]>

codecov · 2023-12-06T06:45:49Z

Codecov Report

Merging #49154 (31e2fce) into master (695d162) will decrease coverage by 18.0745%.
Report is 32 commits behind head on master.
The diff coverage is 78.2258%.

Additional details and impacted files

@@                Coverage Diff                @@
##             master     #49154         +/-   ##
=================================================
- Coverage   71.8223%   53.7478%   -18.0745%     
=================================================
  Files          1444       1549        +105     
  Lines        346984     583242     +236258     
=================================================
+ Hits         249212     313480      +64268     
- Misses        77425     245816     +168391     
- Partials      20347      23946       +3599

Flag	Coverage Δ
integration	`20.9109% <67.7419%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`54.0269% <ø> (-2.2860%)`	⬇️
parser	`∅ <ø> (∅)`
br	`55.5410% <78.2258%> (+4.2402%)`	⬆️

nkg- · 2023-12-06T07:05:26Z

If every requests to suspend lightning failed immediately, we will keep retry for about 10 mins. For some call that stuck, we may cost more time over it. If the GC stop time is essential, perhaps we can make the retry based on time cost over failed requests instead of failure count.

Yeah. Infact, can we make the max pause (gc/schedulers/import) duration configurable. We don't want to pause them more than X (lets say 10 mins). And if during that time, if we cannot pause all tikvs, then its ok to fail the backup. But yeah, retry based on time limit will be ideal. But for implementation, its ok to use retries with exponential backup, and break after a certain time.

A bit outside the scope of this PR. But do we have retries around ebs snapshot trigger (done within backup pod). If create-snapshot api gets throttled, whats the max retry time. Asking since during time, the init pod (and hence pause) will be active. Ok taking this discussion offline (on slack).

Signed-off-by: hillium <[email protected]>

BornChanger · 2023-12-06T08:50:33Z

/retest

tiprow · 2023-12-06T08:50:56Z

@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BornChanger · 2023-12-06T09:09:31Z

/retest-required

tiprow · 2023-12-06T09:09:53Z

@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

YuJuncen · 2023-12-06T09:17:35Z

/test check-dev

tiprow · 2023-12-06T09:17:57Z

@YuJuncen: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test check-dev

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This reverts commit ccdd36b.

Signed-off-by: hillium <[email protected]>

YuJuncen · 2024-01-15T11:36:13Z

/retest-required

Signed-off-by: hillium <[email protected]>

ti-chi-bot · 2024-01-15T17:42:57Z

In response to a cherrypick label: new pull request created to branch release-7.1: #50442.

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot · 2024-01-15T17:43:40Z

In response to a cherrypick label: new pull request created to branch release-7.5: #50443.

ti-chi-bot · 2024-01-15T17:44:22Z

In response to a cherrypick label: new pull request created to branch release-6.5: #50444.

Signed-off-by: ti-chi-bot <[email protected]>

…kup (#49154) (#50444) close #49152, close #49153

…kup (#49154) (#50443) close #49152, close #49153

…kup (pingcap#49154) (pingcap#50444) (pingcap#37) close pingcap#49152, close pingcap#49153 Co-authored-by: Ti Chi Robot <[email protected]>

ti-chi-bot · 2024-04-12T08:33:37Z

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/tidb#release-7.1 from head ti-chi-bot:cherry-pick-49154-to-release-7.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-49154-to-release-7.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request"}

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot · 2024-04-12T08:33:42Z

In response to a cherrypick label: new pull request created to branch release-7.5: #52568.

YuJuncen added 5 commits December 4, 2023 11:49

remove evict-leader-scheduler

a507e5b

Signed-off-by: hillium <[email protected]>

fix bug of TiKV restarted

aca0873

Signed-off-by: hillium <[email protected]>

added some logs

9d66365

Signed-off-by: hillium <[email protected]>

added basic tolarence to tikv failure

686ba32

Signed-off-by: hillium <[email protected]>

make bazel_prepare

0c595fa

Signed-off-by: hillium <[email protected]>

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 4, 2023

YuJuncen added 2 commits December 5, 2023 11:09

added more logs

dbee849

Signed-off-by: hillium <[email protected]>

make clippy happy

fcfcde7

Signed-off-by: hillium <[email protected]>

BornChanger mentioned this pull request Dec 6, 2023

ebs br: backup tolerates the rolling restart pingcap/tidb-operator#5436

Closed

YuJuncen changed the title ~~snap_br: allow temporary TiKV unreachable during starting snapshot backup~~ ebs_br: allow temporary TiKV unreachable during starting snapshot backup Dec 6, 2023

make clippy happy?

a3faf29

Signed-off-by: hillium <[email protected]>

make ttl 10 mins default, timeout over ttl

0cacb1d

Signed-off-by: hillium <[email protected]>

YuJuncen added 2 commits January 15, 2024 18:09

Revert "added prefix to test tables"

de5f4bc

This reverts commit ccdd36b.

really fix the test

b2c778b

Signed-off-by: hillium <[email protected]>

YuJuncen added 2 commits January 15, 2024 19:44

real real fix test case

452ba07

Signed-off-by: hillium <[email protected]>

I hope this will be the last fix on the UI

31e2fce

Signed-off-by: hillium <[email protected]>

ti-chi-bot bot merged commit ac71239 into pingcap:master Jan 15, 2024
25 checks passed

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024

This is an automated cherry-pick of pingcap#49154

afb8814

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this pull request Jan 15, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot backup (#49154) #50442

Open

5 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024

This is an automated cherry-pick of pingcap#49154

9ad606c

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this pull request Jan 15, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot backup (#49154) #50443

Merged

5 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024

This is an automated cherry-pick of pingcap#49154

9a0b304

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this pull request Jan 15, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot backup (#49154) #50444

Merged

5 tasks

ti-chi-bot bot pushed a commit that referenced this pull request Jan 17, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot bac…

021a4e3

…kup (#49154) (#50444) close #49152, close #49153

ti-chi-bot removed the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Feb 5, 2024

ti-chi-bot bot pushed a commit that referenced this pull request Feb 21, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot bac…

3c1eec1

…kup (#49154) (#50443) close #49152, close #49153

ti-chi-bot removed the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Feb 23, 2024

BornChanger added needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. labels Apr 12, 2024

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 12, 2024

This is an automated cherry-pick of pingcap#49154

94faae4

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 12, 2024

This is an automated cherry-pick of pingcap#49154

3756c53

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this pull request Apr 12, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot backup (#49154) #52568

Closed

5 tasks

ti-chi-bot removed the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Apr 16, 2024

BornChanger mentioned this pull request May 20, 2024

EBS BR support to multiple k8s cluster TiDB pingcap/tidb-operator#5003

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

YuJuncen commented Dec 4, 2023

tiprow bot commented Dec 4, 2023

nkg- commented Dec 6, 2023

YuJuncen commented Dec 6, 2023

BornChanger commented Dec 6, 2023

codecov bot commented Dec 6, 2023 •

edited

Loading

nkg- commented Dec 6, 2023

BornChanger commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

BornChanger commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

YuJuncen commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

YuJuncen commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Apr 12, 2024

ti-chi-bot commented Apr 12, 2024

ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

Conversation

YuJuncen commented Dec 4, 2023

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

tiprow bot commented Dec 4, 2023

nkg- commented Dec 6, 2023

YuJuncen commented Dec 6, 2023

BornChanger commented Dec 6, 2023

codecov bot commented Dec 6, 2023 • edited Loading

Codecov Report

nkg- commented Dec 6, 2023

BornChanger commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

BornChanger commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

YuJuncen commented Dec 6, 2023

tiprow bot commented Dec 6, 2023

YuJuncen commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Jan 15, 2024

ti-chi-bot commented Apr 12, 2024

ti-chi-bot commented Apr 12, 2024

codecov bot commented Dec 6, 2023 •

edited

Loading