-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26798 improve balancer visibility #9
base: hubspot-2
Are you sure you want to change the base?
Conversation
Here's a visual on how these changes play out
|
Ive omitted testing here since these changes are focused on logs + adding metrics to existing reporting. |
686b25b
to
3c1b98f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up! Your description does clarify a lot of the confusion, and this definitely seems like an improvement. I added a few small comments.
My only question is whether there's anything we can do to verify that this change will hopefully be the last such change we need (at least for a while). In HubSpot we often move iteratively, pushing small changes over time. When upstreaming things it is costly to take this approach so ideally each pushed solution is relatively complete. I'm not saying this solution is incomplete, but I just don't know.. so want to see if we can maybe show how these changes will help us better tune the balancer going forward.
Also, the TestStochasticBalancerJmxMetrics has some assertions that check for the specific outputted metrics. We should probably add the new "Imbalance" there, right? I'm also surprised that the test doesn't failure on the Percentage change.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
Outdated
Show resolved
Hide resolved
This makes sense too. Would your approach be to deploy this in house and spend a little time watching and waiting? |
So i went back an re-ran the tests I did locally and dug into the build output. I noticed a test was skipped? So maybe thats why it was successful?
Ill dig into how exactly that is working - and add the imbalance check ! |
Ah I just noticed that the test is annotated with That was years ago, so might be worth removing that to see if it works now. If not maybe there's a small change we can do to make it work, but also don't want to send you down a rabbit hole on that.
Yea we could do that. We also could, as an exercise, look at some of the recent balancer alerts and see how the new imbalance metric and/or logs may have sped up our analysis. Or maybe more importantly, is there anything else in the last few alerts you've dug into that you wish we had better visibility into? |
I got the tests to work! Except I had to add a pause to allow the balancer to run (
The tests also allowed me to see the moveCost logs - which atm they feel like too much There are several hundred (at least) of these:
|
Good catch on the new moveCost logs. Wonder if we can make it just log once at the start? Glad to see you could get the tests working again! I don't follow regarding the pause -- i thought the balanceTable call was synchronous? I think a pause might cause issues upstream because pauses often result in flaky tests. When the tests are running all the time in the upstream CI system sometimes system load can cause things to run slower than expected. So 5s might work locally, but might fail sometimes in the CI system. Is there any sort of state we can wait for instead of a pause? Or can we directly call something to trigger the metrics early/on-demand? Sometimes the devs will mock things or pass in like a CountDownLatch or other forms of flow control, but I don't have a great grasp on how helpful/possible those would be in this case. |
Thanks for the feedback. Ive update the move cost to print at most 5 times, though if one time is enough Im fine with that too. |
Still had some tweaks on the tests, but was able to get them successfully work. I think there is some latency between when the metrics are shipped and consumable (seems like this is expected given the existing retry logic ). I added an additional check to make sure we got the correct number of metrics we expect. Once i get final feedback from you on these changes (and the move cost log limit ) Ill squash these into one commit. :) |
final int count = 0; | ||
for (int i = 0; i < 10; i++) { | ||
for (int i = 0; i < 20; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized in another PR the count used on line 214 is never updated... TODO to fix.
Circling back on this. I think with my better understanding of the balancer code some of these changes are just nice to have so that you dont have to read through the code to see exactly whats being reported in the logs/jmx endpoint. But I do think having the weighted imbalance added to the list of reported metrics could be really beneficial for folks since they can use this as a direct reflection of balance/imbalance. @bbeaudreault Let me know your thoughts on this PR and if it should remain open or if is should submit upstream . If you are good with this I'll squash into one commit prior to submitting upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defer to you in terms of what might be most helpful -- you've done a lot more analysis of the balancer skews in recent history.
if what you have here could be useful for others, then definitely open to submitting it upstream. had a couple comments here, but otherwise can submit that whenever you'd like. we can scope down the associated jira if necessary
@@ -132,6 +132,7 @@ public class StochasticLoadBalancer extends BaseLoadBalancer { | |||
protected static final String COST_FUNCTIONS_COST_FUNCTIONS_KEY = | |||
"hbase.master.balancer.stochastic.additionalCostFunctions"; | |||
public static final String OVERALL_COST_FUNCTION_NAME = "Overall"; | |||
public static final String WIGHTED_IMBALANCE_COST_NAME = "Imbalance"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo "WIGHTED"
@@ -480,7 +481,7 @@ public synchronized List<RegionPlan> balanceTable(TableName tableName, Map<Serve | |||
double currentCost = computeCost(cluster, Double.MAX_VALUE); | |||
curOverallCost = currentCost; | |||
System.arraycopy(tempFunctionCosts, 0, curFunctionCosts, 0, curFunctionCosts.length); | |||
updateStochasticCosts(tableName, curOverallCost, curFunctionCosts); | |||
updateStochasticCosts(tableName, curOverallCost, curOverallCost / sumMultiplier, curFunctionCosts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: so we're already passing in curOverallCost to updateStochasticCosts. we call updateStochasticCosts in 3 places and in each place we have to calculate curOverallCost / sumMultiplier
. rather than pass in that value, what if we just pass in sumMultiplier
and do the calculation within the method impl? just a little more DRY
@@ -603,10 +611,30 @@ private void sendRegionPlansToRingBuffer(List<RegionPlan> plans, double currentC | |||
} | |||
} | |||
|
|||
private void logMoveCosts(double currentCost, double [] currentFunctionCosts, int logCount){ | |||
if (logCount > 5){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does printing this 5 times provide us any benefit over printing it once?
Improve our ability to understand how and why the balancer is/isnt balancing: