-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support specifying expected primary in ERS/PRS #16852
Support specifying expected primary in ERS/PRS #16852
Conversation
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16852 +/- ##
========================================
Coverage 69.43% 69.43%
========================================
Files 1571 1571
Lines 203021 203223 +202
========================================
+ Hits 140970 141117 +147
- Misses 62051 62106 +55 ☔ View full report in Codecov by Sentry. |
You'll want to add support for this in the client command, no? Here: go/cmd/vtctldclient/command/reparents.go |
@mattlord oh thanks, yes I'll add support for the CLI binary 👍👀 |
Added: tvaillancourt@tvailla-ltmxctu vitess % ./bin/vtctldclient EmergencyReparentShard --help 2>&1 | grep expected-primary
--expected-primary string Alias of a tablet that must be the current primary in order for the reparent to be processed. 👍 |
Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt Can we have that same cli flag for |
@arthurschreiber in the original issue we discussed adding this to PRS too but the recommendation was to use the new primary flag to fence things. Of course that only works if all systems agree on a candidate, which isn't guaranteed I'd be on board with adding this to PRS too if no objections |
Adding this to PRS would make our operations a lot more robust. Obviously, PRS takes a lock before modifying a shard, so no two PRS operations can run at the same time, but we'd also like to guarantee that we don't fail over a shard during the buffer cooldown period we have set. We can check the time when the current primary started serving, but we can't guarantee that a PRS from some other automation or manual call jumps in between the check of the timestamp and the call to PRS. Allowing us to specify the expected current primary for PRS as well would allow us to abort if the primary got changed unexpectedly. |
I think the idea of being able to enforce a specific sequence or state machine for the shard leadership/primaryship makes a lot of sense for both ERS and PRS, FWIW. 🙂 |
@arthurschreiber / @mattlord yeah, I think it's a good idea for both so I'll update this PR to add that As mentioned, Vitess could use this internally too (probably in +1 release). It would never hit a conflict (unless we broke something) but it wouldn't hurt either EDIT: I also realized |
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
I think I prefer the field-name 🎉 for Or other suggestions appreciated |
Signed-off-by: Tim Vaillancourt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Once the flag name is final, you should edit the PR description to document command usage. I've added a label that prevents merge, and you can remove the label and merge it at that point.
Sounds good! It looks like there weren't many strong opinions on naming I think I've flip/flopped back to leaving this as-is. My reasoning is if I say what I'm asking to do out loud it is "do X reparent and I expect Y to be still be the primary". And I feel adding |
Description
This PR allows a
EmergencyReparentShard
/PlannedReparentShard
request to contain the alias of the primary we expect to be the current primary for the action to succeed. If this alias is incorrect or stale, the reparent fails with an error indicating there is a failed precondition/mismatch.This is useful to prevent races between external automation that runs reparents and Vitess (
vtorc
andvtctld
), which may reparent a shard in the time external automation is creating/processing it's own reparent request. With this support added, automation with a stale view of the world will encounter an error instead of triggering an potentially unnecessary reparentIn the future, Vitess could use this field internally to be more explicit but because everything uses shard locks, it should never encounter a mismatch
Related Issue(s)
Resolves #16430
Checklist
Deployment Notes