Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-26798 improve balancer visibility #9

Open
wants to merge 4 commits into
base: hubspot-2
Choose a base branch
from

Conversation

briaugenreich
Copy link

@briaugenreich briaugenreich commented Mar 15, 2022

Improve our ability to understand how and why the balancer is/isnt balancing:

  1. Report new metrics that match the values in the log output
  2. Add existing metrics to the log statements
  3. Add additional logging to report when the move cost is greatly impacting the ability for the balancer to take action

@briaugenreich
Copy link
Author

Currently the following is shared in the logs .

 Cluster wide - skipping load balancing because weighted average imbalance=0.0018143527755874955 <= 
threshold(0.025). If you want more aggressive balancing, either lower 
hbase.master.balancer.stochastic.minCostNeedBalance from 0.025 or increase the relative multiplier(s) of the specific cost 
function(s). functionCost=RegionCountSkewCostFunction : (multiplier=500.0, imbalance=0.0, balanced); 
PrimaryRegionCountSkewCostFunction : (not needed); MoveCostFunction : (multiplier=100.0, imbalance=0.0, balanced); 
ServerLocalityCostFunction : (multiplier=25.0, imbalance=0.0, balanced); TableSkewCostFunction : (multiplier=35.0,
 imbalance=0.0, balanced); RegionReplicaHostCostFunction : (not needed); RegionReplicaRackCostFunction : (not needed);
 ReadRequestCostFunction : (multiplier=10.0, imbalance=0.0774546308482926); WriteRequestCostFunction : 
(multiplier=5.0, imbalance=0.047481522296745335); MemStoreSizeCostFunction : (multiplier=5.0, imbalance=0.0, 
balanced); StoreFileCostFunction : (multiplier=5.0, imbalance=0.04617554626215634);

And the following metrics are reported to jmx (and shared in our dashboards).
image
image
image

While we know the metrics and log out put relate via the equations below, however lining up the correct log lines and data points and performing a bunch of calculations is not very efficient or effective. My next comment will present how this is improved by these changes.

  1. The overall cost reported in the metrics is defined below. This metric is NOT the same as the overall weighted_imbalance the hbase balancer reports in the logs and uses to monitor the state of imbalance. But this can be calculate from each of the reported costs and multipliers in the logs.

    overall = Σ cost * multiplier

  2. The individual costs reported in the metrics are defined below. They in-fact are not the costs but the calculated percentage each cost function is driving up the overall cost.

percent cost = (multiplier * cost )/ overall *100

  1. Overall imbalance cost printed in the logs is defined below. This is the cost hbase uses to take action/compute changes on a cluster (or table).

logs_overall_cost = overall / (Σmultipliers)

@briaugenreich
Copy link
Author

Here's a visual on how these changes play out

  • its clear that the cost values reported for each cost function represents the percentage that cost makes up the overall cost (vs being confused that it doesnt match whats printed in the logs)
  • its also clear that the overall cost is different from the overall imbalance score (overall cost/sum of multipliers) . and we can see how
  • then we will need to update the collectd metrics to add the overall imbalance to our dashboards for hb2

image

hmaster-artemis-hb2-a-1-54d7955cb6-6zk99 hmaster 2022-03-15 18:58:00,423 [master/hmaster-artemis-hb2-a-1-
54d7955cb6-6zk99:60000.Chore.1] INFO org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: 
Cluster wide - skipping load balancing because weighted average imbalance=0.005019418895810577 <= threshold(0.025).
 If you want more aggressive balancing, either lower hbase.master.balancer.stochastic.minCostNeedBalance from 0.025 or 
increase the relative multiplier(s) of the specific cost function(s). functionCost=RegionCountSkewCostFunction : 
(multiplier=500.0, imbalance=0.0); PrimaryRegionCountSkewCostFunction : (not needed); MoveCostFunction : 
(multiplier=100.0, imbalance=0.0); ServerLocalityCostFunction : (multiplier=25.0, imbalance=0.0); TableSkewCostFunction : 
(multiplier=35.0, imbalance=0.0); RegionReplicaHostCostFunction : (not needed); RegionReplicaRackCostFunction : (not 
needed); ReadRequestCostFunction : (multiplier=10.0, imbalance=0.17310961637962888, need balance); 
WriteRequestCostFunction : (multiplier=5.0, imbalance=0.018860510805500982); MemStoreSizeCostFunction : 
(multiplier=5.0, imbalance=0.0); StoreFileCostFunction : (multiplier=5.0, imbalance=0.3225806451612903, need balance);

@briaugenreich
Copy link
Author

Ive omitted testing here since these changes are focused on logs + adding metrics to existing reporting.

@briaugenreich briaugenreich force-pushed the briaugenreich_HBASE-26798_hubspot-2 branch from 686b25b to 3c1b98f Compare March 16, 2022 13:58
Copy link
Member

@bbeaudreault bbeaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! Your description does clarify a lot of the confusion, and this definitely seems like an improvement. I added a few small comments.

My only question is whether there's anything we can do to verify that this change will hopefully be the last such change we need (at least for a while). In HubSpot we often move iteratively, pushing small changes over time. When upstreaming things it is costly to take this approach so ideally each pushed solution is relatively complete. I'm not saying this solution is incomplete, but I just don't know.. so want to see if we can maybe show how these changes will help us better tune the balancer going forward.

Also, the TestStochasticBalancerJmxMetrics has some assertions that check for the specific outputted metrics. We should probably add the new "Imbalance" there, right? I'm also surprised that the test doesn't failure on the Percentage change.

@briaugenreich
Copy link
Author

When upstreaming things it is costly to take this approach so ideally each pushed solution is relatively complete.

This makes sense too. Would your approach be to deploy this in house and spend a little time watching and waiting?

@briaugenreich
Copy link
Author

Also, the TestStochasticBalancerJmxMetrics has some assertions that check for the specific outputted metrics. We should probably add the new "Imbalance" there, right? I'm also surprised that the test doesn't failure on the Percentage change.

So i went back an re-ran the tests I did locally and dug into the build output. I noticed a test was skipped? So maybe thats why it was successful?

mvn clean verify -pl hbase-server -am -Dtest=org.apache.hadoop.hbase.master.balancer.TestStochasticBalancerJmxMetrics,org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer  -DfailIfNoTests=false -Dhadoop.profile=3.0
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hbase.master.balancer.TestStochasticBalancerJmxMetrics
[INFO] Running org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.041 s - in org.apache.hadoop.hbase.master.balancer.TestStochasticBalancerJmxMetrics
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 134.914 s - in org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancer
[INFO] 
[INFO] Results:
[INFO] 
[WARNING] Tests run: 16, Failures: 0, Errors: 0, Skipped: 1
[INFO] 

Ill dig into how exactly that is working - and add the imbalance check !

@bbeaudreault
Copy link
Member

Ah I just noticed that the test is annotated with @Ignore, ever since https://issues.apache.org/jira/browse/HBASE-15348

That was years ago, so might be worth removing that to see if it works now. If not maybe there's a small change we can do to make it work, but also don't want to send you down a rabbit hole on that.

This makes sense too. Would your approach be to deploy this in house and spend a little time watching and waiting?

Yea we could do that. We also could, as an exercise, look at some of the recent balancer alerts and see how the new imbalance metric and/or logs may have sped up our analysis. Or maybe more importantly, is there anything else in the last few alerts you've dug into that you wish we had better visibility into?

@briaugenreich
Copy link
Author

briaugenreich commented Mar 17, 2022

I got the tests to work! Except I had to add a pause to allow the balancer to run (Computation took 778 ms to try 19200 different iterations. ) and then publish the metrics.

[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hbase.master.balancer.TestStochasticBalancerJmxMetrics
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.123 s - in org.apache.hadoop.hbase.master.balancer.TestStochasticBalancerJmxMetrics
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO] 

The tests also allowed me to see the moveCost logs - which atm they feel like too much

There are several hundred (at least) of these:

2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=3.5 percentageOfCost=1.0. Consider lowering moveCost multiplier.
2022-03-17 11:46:31,677 INFO  [Time-limited test] balancer.StochasticLoadBalancer(621): MoveCostFunction greatly impacting overall cost of improvement plan. currentCost=2.333333333333333 percentageOfCost=1.0. Consider lowering moveCost multiplier.

@bbeaudreault
Copy link
Member

Good catch on the new moveCost logs. Wonder if we can make it just log once at the start? Glad to see you could get the tests working again!

I don't follow regarding the pause -- i thought the balanceTable call was synchronous? I think a pause might cause issues upstream because pauses often result in flaky tests. When the tests are running all the time in the upstream CI system sometimes system load can cause things to run slower than expected. So 5s might work locally, but might fail sometimes in the CI system.

Is there any sort of state we can wait for instead of a pause? Or can we directly call something to trigger the metrics early/on-demand? Sometimes the devs will mock things or pass in like a CountDownLatch or other forms of flow control, but I don't have a great grasp on how helpful/possible those would be in this case.

@briaugenreich
Copy link
Author

Thanks for the feedback.

Ive update the move cost to print at most 5 times, though if one time is enough Im fine with that too.
I also reworked the tests to utilize the existing retry functionality to ensure we get more than the tag.Context and tag.Hostname metrics.

@briaugenreich
Copy link
Author

Still had some tweaks on the tests, but was able to get them successfully work. I think there is some latency between when the metrics are shipped and consumable (seems like this is expected given the existing retry logic ). I added an additional check to make sure we got the correct number of metrics we expect.

Once i get final feedback from you on these changes (and the move cost log limit ) Ill squash these into one commit. :)

final int count = 0;
for (int i = 0; i < 10; i++) {
for (int i = 0; i < 20; i++) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized in another PR the count used on line 214 is never updated... TODO to fix.

@briaugenreich
Copy link
Author

briaugenreich commented May 25, 2022

Circling back on this. I think with my better understanding of the balancer code some of these changes are just nice to have so that you dont have to read through the code to see exactly whats being reported in the logs/jmx endpoint. But I do think having the weighted imbalance added to the list of reported metrics could be really beneficial for folks since they can use this as a direct reflection of balance/imbalance.

@bbeaudreault Let me know your thoughts on this PR and if it should remain open or if is should submit upstream . If you are good with this I'll squash into one commit prior to submitting upstream.

Copy link
Member

@bbeaudreault bbeaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to you in terms of what might be most helpful -- you've done a lot more analysis of the balancer skews in recent history.

if what you have here could be useful for others, then definitely open to submitting it upstream. had a couple comments here, but otherwise can submit that whenever you'd like. we can scope down the associated jira if necessary

@@ -132,6 +132,7 @@ public class StochasticLoadBalancer extends BaseLoadBalancer {
protected static final String COST_FUNCTIONS_COST_FUNCTIONS_KEY =
"hbase.master.balancer.stochastic.additionalCostFunctions";
public static final String OVERALL_COST_FUNCTION_NAME = "Overall";
public static final String WIGHTED_IMBALANCE_COST_NAME = "Imbalance";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo "WIGHTED"

@@ -480,7 +481,7 @@ public synchronized List<RegionPlan> balanceTable(TableName tableName, Map<Serve
double currentCost = computeCost(cluster, Double.MAX_VALUE);
curOverallCost = currentCost;
System.arraycopy(tempFunctionCosts, 0, curFunctionCosts, 0, curFunctionCosts.length);
updateStochasticCosts(tableName, curOverallCost, curFunctionCosts);
updateStochasticCosts(tableName, curOverallCost, curOverallCost / sumMultiplier, curFunctionCosts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: so we're already passing in curOverallCost to updateStochasticCosts. we call updateStochasticCosts in 3 places and in each place we have to calculate curOverallCost / sumMultiplier. rather than pass in that value, what if we just pass in sumMultiplier and do the calculation within the method impl? just a little more DRY

@@ -603,10 +611,30 @@ private void sendRegionPlansToRingBuffer(List<RegionPlan> plans, double currentC
}
}

private void logMoveCosts(double currentCost, double [] currentFunctionCosts, int logCount){
if (logCount > 5){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does printing this 5 times provide us any benefit over printing it once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants