ProxLB doesn't think the containers need to be moved #29

ewenlau · 2024-07-22T17:48:55Z

So I've got two nodes, pve01 and pve02. One has 100GB of RAM, the other 24GB. At the time of writing, both have about 10GB of used RAM, but pve01 has only about 10% of it's RAM used while pve02 has closer to 50%. ProxLB does indeed conclude that the memory usage is not equal between both nodes, but reports wildy different results for RAM usage, which makes me think it uses the free memory instead of the available, which it shouldn't do in my opinion, but that's beyond the scope of this issue: <6> ProxLB: Info: [balanciness-validation]: Rebalancing for memory is needed. Highest usage: 88% | Lowest usage: 46%..

However, it then decides that no rebalancing is needed for some reason <6> ProxLB: Info: [rebalancing-executor]: No rebalancing needed. I don't understand why it decides that nothing should be moved. I have a very clear majority of LXCs in my setup (and few rare VMs), and I used the latest proxlb file in the main branch, and set it to all.

There's also those two lines in the logs which might be indicating that it's running in dry-run mode (which it isn't):
<6> ProxLB: Info: [dry-run-output-generator]: Starting dry-run to rebalance vms to their new nodes.
<6> ProxLB: Info: [dry-run-output-generator]: No rebalancing needed.

Why is it doing that? I looked in the code but I didn't find any particular issues. I did not check the rebalancing algorithm, so the issue might be in there.

I've linked the log and config files below
proxlb.log
proxlb.conf.txt

The text was updated successfully, but these errors were encountered:

gyptazy · 2024-07-22T18:24:29Z

Hey @ewenlau,

thanks for you bug report and I guess that's based on my assumptions that you mostly want to run more or less equal nodes in a cluster where the CPU and memory are the same. Of course, you're right and this might not be the case everywhere. But let's have a look at this:

The issue here is that the gap between both nodes is too big and it therefore tries to place it in node01, where it is already placed. You have configured a balanciness of 10 which means that the gap might just differ between 88% on node01 and 78% on node02. Now, it tries to rebalance and sees that the best match with the most free resources is node01, but according to the log the containers are already placed there (see the node_parent and node_rebalanced keys in the JSON).

So, I guess it's more about what should be taken as metric for memory and disk balancing - the used space (to have the overall percentages of resources more equal or the free space to make sure the most available resources are being used).

I'm happy for some input, the behaviour can be changed or even be defined as a config parameter to make everyone comfortable.

The second one with the dry-run mode looks indeed like a bug, but more like a "logging" bug because there isn't anything to rebalance (because parent and rebalancing node are the same). But I'll have a look at it tomorrow.

Thanks for your report and the logs!

Cheers,
gyptazy

ewenlau · 2024-07-22T18:41:14Z

Hey @ewenlau,

thanks for you bug report and I guess that's based on my assumptions that you mostly want to run more or less equal nodes in a cluster where the CPU and memory are the same. Of course, you're right and this might not be the case everywhere. But let's have a look at this:

The issue here is that the gap between both nodes is too big and it therefore tries to place it in node01, where it is already placed. You have configured a balanciness of 10 which means that the gap might just differ between 88% on node01 and 78% on node02. Now, it tries to rebalance and sees that the best match with the most free resources is node01, but according to the log the containers are already placed there (see the node_parent and node_rebalanced keys in the JSON).

So, I guess it's more about what should be taken as metric for memory and disk balancing - the used space (to have the overall percentages of resources more equal or the free space to make sure the most available resources are being used).

I'm happy for some input, the behaviour can be changed or even be defined as a config parameter to make everyone comfortable.

The second one with the dry-run mode looks indeed like a bug, but more like a "logging" bug because there isn't anything to rebalance (because parent and rebalancing node are the same). But I'll have a look at it tomorrow.

Thanks for your report and the logs!

Cheers,
gyptazy

Hello,
I understand the difference between both nodes is large, but I don't understand how it matters. Doesn't it balance based on the percentage of RAM used rather than the actual GBs? Why is it selecting to move a container from the node which has less memory usage?

gyptazy · 2024-07-22T18:48:11Z

No, it’s not based on the percentage when rebalancing, it’s based on the currently free memory on the node. The node with the most free memory (in size, not in percentage of the nodes local's view) will be used. Guess, for such setups a config parameter to balance by the % value of the nodes might make more sense to for everyone’s need. Would that fit your needs?

ewenlau · 2024-07-22T21:00:34Z

I think it would solve that issue, yes.

Fixes: #29

gyptazy · 2024-07-25T06:46:43Z

Hey @ewenlau,

maybe you can give the PR #32 a try. It's currently just a dirt-hack since I just want to know, if this is what you want.

Please change https://github.com/gyptazy/ProxLB/blob/main/proxlb.conf#L8 (mode) to free_percent.

If this is what you want, I will create a new dedicated key for this in the options (mode_node) which can be defined if it should use the free resources in bytes or in percent on a node in the cluster.

Hope it helps.

Cheers,
gyptazy

ewenlau · 2024-07-26T12:08:27Z

Hey @ewenlau,

maybe you can give the PR #32 a try. It's currently just a dirt-hack since I just want to know, if this is what you want.

Please change https://github.com/gyptazy/ProxLB/blob/main/proxlb.conf#L8 (mode) to free_percent.

If this is what you want, I will create a new dedicated key for this in the options (mode_node) which can be defined if it should use the free resources in bytes or in percent on a node in the cluster.

Hope it helps.

Cheers, gyptazy

Hello,
It looks like it did not change anything. The output is the exact same as before. Looking at the commit, I have the feeling you may have forgotten to implement the change? I might be wrong, but to me it looks like the only change is that is registers the 'free_percent' balancing method as valid and then simply uses the used one. Again, I might be wrong, but seeing as the output of the command barely differs between both options, it's my best guess.

gyptazy · 2024-07-26T14:52:21Z

Hey @ewenlau,

the related and important change is in https://github.com/gyptazy/ProxLB/pull/32/files#diff-4d47e7584181ff92b3c3f57588b89e4fb11158ac22f3d50066588c07267e5a86R580-R581 where it now obtains the free percent value, instead of the free bytes value if that option is set. This metrics are obtained from the node_statistics dictionary.

In my test it looks like:

# /bin/proxlb -d -c /etc/proxlb/proxlb.conf
node01: 66% free
node02: 84% free
node03: 79% free
Selected node to use: node02

Maybe I misunderstood your request?

ewenlau · 2024-07-28T12:10:45Z

Hello,
It was actually an issue on my part, and the rebalancing part seems to be working fine too, I'd love to see this toggle in the config for Release 1. There's just one slight issue with containers, is that it seems to be wanting to read the config files from the qemu-server folder, which obviously doesn't work for containers, and results in them not migrating.

Here's the log:
<2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/143.conf' does not exist <6> ProxLB: Info: [rebalancing-executor]: Rebalancing vm plc-panel01 from node pve02 to node pve01. <2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/148.conf' does not exist <6> ProxLB: Info: [rebalancing-executor]: Rebalancing vm authentik01 from node pve02 to node pve01. <2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/144.conf' does not exist <6> ProxLB: Info: [rebalancing-executor]: Rebalancing vm immich01 from node pve02 to node pve01. <2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/147.conf' does not exist <6> ProxLB: Info: [rebalancing-executor]: Rebalancing vm zoraxy01 from node pve02 to node pve01. <2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/140.conf' does not exist <6> ProxLB: Info: [rebalancing-executor]: Rebalancing vm mariadb-main from node pve02 to node pve01. <2> ProxLB: Error: [rebalancing-executor]: 500 Internal Server Error: Configuration file 'nodes/pve02/qemu-server/150.conf' does not exist

There's also these lines which I think shouldn't show up? Correct me if I'm wrong.
<6> ProxLB: Info: [dry-run-output-generator]: Starting dry-run to rebalance vms to their new nodes. <6> ProxLB: Info: [dry-run-output-generator]: Printing cli output of VM rebalancing.

Thanks a lot for your work, I truly think this project is great. Also, sorry for the late replies, I'm on vacations at the moment and haven't got a lot of free time.

gyptazy · 2024-07-28T14:20:31Z

Hey @ewenlau,

thanks for your reply. Happy to hear that this finally works for you and also thanks for replying the other things.
I'll split it into three parts:

Integrating the rebalancing by assigned or percent free resources
Validating why the QEMU migration part is used by CT's instead of the right one
Validating that the log output is matching

This three things are my primary goal for the initial release of 1.0.0 before integrating new features. I hope I can finalize everything the upcoming week.

Thank you!

Fixes: #27 Fixes: #29

Fixes: #29

gyptazy · 2024-07-29T10:57:17Z

Hey @ewenlau,

with PR #32 I introduced the new option option to to rebalance by the node's free resources in percent instead of bytes. The operation mode for this can be changed by the newly introduced option mode_option which is by default bytes.

A user can define this by setting bytes or percent.

That PR also adds a function to validate if there are objects of type VM or CT to rebalance to avoid raising a stack trace when no objects are present in a cluster (e.g. freshly installed cluster).

If you could just give this another try before merging would be great (it also fixes the log output now). Your other request was fixed with PR #33.

Thanks,
gyptazy

Fixes: #29

gyptazy self-assigned this Jul 22, 2024

gyptazy added the bug Something isn't working label Jul 22, 2024

gyptazy added this to the Release 1.0.0 milestone Jul 22, 2024

gyptazy mentioned this issue Jul 23, 2024

Documentation: Update and improve documentations for release 1.0.0 #30

Closed

gyptazy added a commit that referenced this issue Jul 25, 2024

feature: Add option to select best node by free resources in percent.

2df14cb

Fixes: #29

gyptazy mentioned this issue Jul 25, 2024

feature: Add option to select best node by free resources in percent. #32

Closed

gyptazy added a commit that referenced this issue Jul 28, 2024

fix: Rebalance CT function including reboot

4b95a6f

Fixes: #27 Fixes: #29

gyptazy mentioned this issue Jul 28, 2024

fix: Rebalance CT function including reboot #33

Merged

gyptazy added a commit that referenced this issue Jul 28, 2024

fix: Rebalance CT function including reboot

c4b22a4

Fixes: #27 Fixes: #29

gyptazy closed this as completed in #33 Jul 28, 2024

gyptazy closed this as completed in 4efa9df Jul 28, 2024

gyptazy reopened this Jul 28, 2024

gyptazy added a commit that referenced this issue Jul 29, 2024

feature: Add new mode_option to rebalance by node's bytes or percent.

6284893

Fixes: #29

gyptazy added a commit that referenced this issue Jul 29, 2024

feature: Add new mode_option to rebalance by node's bytes or percent.

c2f3651

Fixes: #29

gyptazy mentioned this issue Jul 29, 2024

feature: Add new mode_option to rebalance by node's bytes or percent. #34

Merged

gyptazy added a commit that referenced this issue Jul 30, 2024

feature: Add new mode_option to rebalance by node's bytes or percent.

46832ba

Fixes: #29

gyptazy closed this as completed in #34 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProxLB doesn't think the containers need to be moved #29

ProxLB doesn't think the containers need to be moved #29

ewenlau commented Jul 22, 2024

gyptazy commented Jul 22, 2024

ewenlau commented Jul 22, 2024

gyptazy commented Jul 22, 2024 •

edited

Loading

ewenlau commented Jul 22, 2024

gyptazy commented Jul 25, 2024

ewenlau commented Jul 26, 2024 •

edited

Loading

gyptazy commented Jul 26, 2024 •

edited

Loading

ewenlau commented Jul 28, 2024

gyptazy commented Jul 28, 2024

gyptazy commented Jul 29, 2024 •

edited

Loading

ProxLB doesn't think the containers need to be moved #29

ProxLB doesn't think the containers need to be moved #29

Comments

ewenlau commented Jul 22, 2024

gyptazy commented Jul 22, 2024

ewenlau commented Jul 22, 2024

gyptazy commented Jul 22, 2024 • edited Loading

ewenlau commented Jul 22, 2024

gyptazy commented Jul 25, 2024

ewenlau commented Jul 26, 2024 • edited Loading

gyptazy commented Jul 26, 2024 • edited Loading

ewenlau commented Jul 28, 2024

gyptazy commented Jul 28, 2024

gyptazy commented Jul 29, 2024 • edited Loading

gyptazy commented Jul 22, 2024 •

edited

Loading

ewenlau commented Jul 26, 2024 •

edited

Loading

gyptazy commented Jul 26, 2024 •

edited

Loading

gyptazy commented Jul 29, 2024 •

edited

Loading