Hotswap allow different alpha scalings and ranks #2177

BenjaminBossan · 2024-10-24T15:18:02Z

See huggingface/diffusers#9453 for context.

Hotswapping of LoRA adapters is already implemented, but when alpha scalings or ranks differ, this triggers recompilation of the model is compiled, which is inefficient. Users can now call prepare_model_for_compiled_hotswap to prevent recompilation in many cases (see the doc update for caveats).

HuggingFaceDocBuilderDev · 2024-10-24T15:22:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2024-11-24T15:03:25Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2024-11-25T10:36:08Z

not stale

Smaller to larger rank works, larger to smaller still fails.

Differing alphas is no longer cause to raise an error.

BenjaminBossan · 2025-01-28T14:51:30Z

@sayakpaul I had some time to brush up this PR. How do we proceed? As you can see, I have added diffusers tests, but since huggingface/diffusers#9453 is not merged, it won't work yet (i.e. there are recompiles). That PR is awaiting this PR, so there is a mutual dependency.

Also, for now Conv2d is not fully implemented, but to confirm that it works, we could test with only Linear, right?

sayakpaul · 2025-01-28T14:57:59Z

Also, for now Conv2d is not fully implemented, but to confirm that it works, we could test with only Linear, right?

That should work for now. I would add a note about this somewhere.

@sayakpaul I had some time to brush up this PR. How do we proceed? As you can see, I have added diffusers tests, but since huggingface/diffusers#9453 is not merged, it won't work yet (i.e. there are recompiles).

To make progress and help unblock this PR, how about we merge this PR with sufficient testing (which I believe we already have) and mark the diffusers test with an xfail?

githubnemo

LGTM with some minor comments

src/peft/utils/hotswap.py

BenjaminBossan

Your comments should be addressed please check again. I split the update into 2 commits, 1 for the refactor and 1 for the rest.

src/peft/utils/hotswap.py

githubnemo

The changes look good. While inspecting _get_padded_* I found potentially missing bias handling during padding. Can you take a look?

src/peft/utils/hotswap.py

sayakpaul · 2025-01-31T14:45:29Z

I will take a look at the PR too. Thanks, Ben! Sayak Paul | sayak.dev

…

On Fri, 31 Jan 2025 at 8:10 PM, githubnemo ***@***.***> wrote: ***@***.**** commented on this pull request. The changes look good. While inspecting _get_padded_* I found potentially missing bias handling during padding. Can you take a look? ------------------------------ In src/peft/utils/hotswap.py <#2177 (comment)>: > + if original_rank == target_rank: + return lora_module + + if original_rank > target_rank: + raise ValueError( + f"Trying to pad the adapter to the target rank {target_rank}, but the original rank is larger " + f"({original_rank}). This is not possible." + ) + + out_features, in_features = weight.shape + + if is_lora_A: + # LoRA A affects out_features + padded = torch.zeros(target_rank, in_features, device=weight.device, dtype=weight.dtype) + padded[:original_rank, :] = weight + new_layer = torch.nn.Linear(in_features, target_rank, bias=lora_module.bias) Doesn't the bias need padding in this case as well? ------------------------------ In src/peft/utils/hotswap.py <#2177 (comment)>: > + bias=lora_module.bias, + groups=lora_module.groups, + ) + else: + # LoRA B affects in_channels + padded = torch.zeros(out_channels, target_rank, kh, kw, device=weight.device, dtype=weight.dtype) + padded[:, :in_channels, :, :] = weight + new_layer = torch.nn.Conv2d( + target_rank, + out_channels, + kernel_size=lora_module.kernel_size, + stride=lora_module.stride, + padding=lora_module.padding, + bias=lora_module.bias, + groups=lora_module.groups, + ) Possibly missing bias padding. See above. — Reply to this email directly, view it on GitHub <#2177 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPE2TELZZVKGWQB6XU2ALT2NODPBAVCNFSM6AAAAABQRK4E6WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBWHE2TKMZUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

BenjaminBossan

Thanks @githubnemo for catching the potential issue with the bias. I have addressed this, as well as an earlier comment that i missed last time.

src/peft/utils/hotswap.py

sayakpaul

Thanks a lot, Benjamin! I have left some comments. LMK if they make sense.

sayakpaul · 2025-02-02T14:41:54Z

docs/source/package_reference/hotswap.md

+# load lora 0
+model = PeftModel.from_pretrained(model, <path-adapter-0>)
+# Prepare the model to allow hotswapping even if ranks/scalings of 2nd adapter differ.
+# You can skip this step if all ranks and scalings are identical.
+prepare_model_for_compiled_hotswap(model, target_rank=max_rank)


Would it make sense to allow a util like infer_max_rank_from_state_dict()? Because it can be difficult to know for the users to know max_rank from the get-go. WDYT?

I was thinking about how an application would use this in practice for the reason you mentioned. If we break it down like this:

Some business use case of serving different LoRAs: They can probably determine the max rank ahead of time and set it correctly.

An image generation app like Comfy: This is hard as they cannot know in advance what LoRAs the user plans on loading. Theoretically, all available LoRAs could be scanned (which might be expensive) and the largest rank be chosen (which might be total overkill and damage runtime performance).

If we had a infer_max_rank_from_state_dict() it would be for the 2nd case, which has the mentioned issues, but correct me if I'm wrong. But regardless of that, where would the state dicts come from? Since we have multiple LoRA formats in the diffusion ecosystem and diffusers knows how to deal with them, would that function not be better added to diffusers?

Fair enough.

sayakpaul · 2025-02-02T14:44:03Z

docs/source/package_reference/hotswap.md

- The adapters must be compatible (e.g. same LoRA alpha, same target modules).
- If you use `torch.compile` and want to avoid recompilation, the LoRA rank must be the same.
+- It only works for the same PEFT method, so no swapping LoRA and LoHa, for example.
+- The adapters must be compatible (e.g. same target modules).


If the first adapter already contains a subset of the target_modules of the second adapter, for example, -- would it still be a problem?

This limitation seems to be quite critical to me as a user, though.

We have 2 scenarios when the target modules are not identical:

The 2nd adapter targets a subset of the 1st adapter: We can theoretically swap in the 2nd adapter, since there will be a LoRA layer in place where it is needed. However, the results will be wrong because the we will also have unneeded LoRA layers (e.g. LoRA 1 targets foo and bar and LoRA 2 only targets bar, then foo is still present). Therefore, we need to ensure that the missing weights are zeroed out. I added logic and tests for this.

The 2nd adapter targets a superset of layers or layers are partly disjoint: In this case, we would not to create new LoRA layers. This logic is not implemented (yet). But even if added, it would require recompilation and is thus not very useful.

This limits what the user can do with this feature but I don't have a good proposal how to solve it, except for asking users to swap in the LoRA that targets most layers first.

sayakpaul · 2025-02-02T14:45:35Z

src/peft/utils/hotswap.py

+        scaling = module.scaling
+        for key, val in scaling.items():
+            if isinstance(val, float):
+                scaling[key] = torch.tensor(val, device=module.weight.device)


Do we have to consider the dtype, too?

I did some quick testing of matrix x scalar multiplication and it seems like torch always coerces to the dtype of the matrix, so we should be fine.

Think having that explicitly mentioned might just be more readable for maintainers. But no strong opinions.

sayakpaul · 2025-02-02T15:01:27Z

src/peft/utils/hotswap.py

+    if is_lora_A:
+        # LoRA A affects out_features
+        padded = torch.zeros(target_rank, in_features, device=weight.device, dtype=weight.dtype)
+        padded[:original_rank, :] = weight
+        new_layer = torch.nn.Linear(in_features, target_rank, bias=lora_module.bias is not None)
+    else:
+        # LoRA B affects in_features
+        padded = torch.zeros(out_features, target_rank, device=weight.device, dtype=weight.dtype)
+        padded[:, :original_rank] = weight
+        new_layer = torch.nn.Linear(target_rank, out_features, bias=lora_module.bias is not None)


I would also add a comment to note that we might revisit if we have to consider if the corresponding layers come from a quantization backend.

Not sure if I understand. The LoRA A and B layers are always nn.Linear, no matter if the base layer is quantized or not.

I am thinking about functions like this:

peft/src/peft/tuners/lora/bnb.py

Line 546 in 2825774

def dispatch_bnb_4bit(target: torch.nn.Module, adapter_name: str, **kwargs):

Or do we simply use these for injection into the base model layers and NOT the adapter layers?

(Apologies in advance for not thoroughly reading through the corresponding code).

So if the model has a linear layer that is targeted with LoRA, we check the type of linear layer and replace it with a normal LoRA Linear, with a bnb Linear4bit, etc. These PEFT layers have lora_A and lora_B sub-modules, but those are always nn.Linear.

In this padding function, we pad the lora_A and lora_B layers, thus we know that they're always nn.Linear, regardless of what the type of the original targeted layer was.

I added small comments for that.

sayakpaul · 2025-02-02T15:02:00Z

src/peft/utils/hotswap.py

+    if lora_module.bias is not None:
+        new_layer.bias.data = lora_module.bias.data


Could potentially also leverage the copy_() method?

Hmm, I tried but got:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function linear>(*(FakeTensor(..., device='cuda:0', size=(10, 1, 16)), Parameter(FakeTensor(..., size=(13, 16), requires_grad=True)), None), **{}): Unhandled FakeTensor Device Propagation for aten.mm.default, found two different devices cuda:0, cpu

Not sure why that is, since we take care of the device, but I reverted the change. I think in this specific situation, there is no practical benefit to .copy_ anyway, is there?

Fair enough. Thanks for looking into it.

sayakpaul · 2025-02-02T15:05:52Z

src/peft/utils/hotswap.py

+    is_compiled = hasattr(model, "_orig_mod")
+    if is_compiled:
+        raise ValueError("Call prepare_model_for_compiled_hotswap *before* compiling the model")


Well, _orig_mod is present when we do torch.compile() but compilation can be also triggerd with model.compile().

Not sure if we can do anything about that. Any suggestion?

We could check with torch.compiler.is_compiling().

You mean:

- if is_compiled: + if is_compiled or torch.compiler.is_compiling():

IIUC, this would not help. We can call this function inside, say, a forward method of an nn.Module to check if the module is being compiled. However, prepare_model_for_compiled_hotswap is not called inside the module. The result of torch.compiler.is_compiling() would thus always return False (except if someone called it inside forward and then compiles that module, but that would be very incorrect).

Okay. Then maybe having a comment to mention the .compile() use case and that we don't know how to deal with that yet would be nice?

Okay, I did find a way to check for model.compile(), namely the _compiled_call_impl attribute: https://github.com/pytorch/pytorch/blob/1eba9b3aa3c43f86f4a2c807ac8e12c4a7767340/torch/nn/modules/module.py#L2991

It is private, so I used getattr to ensure this doesn't break easily. I also tested this and it seems to work, although I'm not sure when users should use model.compile instead of torch.compile(model).

sayakpaul · 2025-02-02T15:07:19Z

src/peft/utils/hotswap.py

+        # config can be either a PeftConfig, or a dict of PeftConfigs like PeftModel.peft_config
+        config = {"dummy": config}


This config won't have any repercussions because it's never assigned to a PeftModel, right?

The {"dummy": config} is not assigned, but the passed in config argument is still mutated, so the intended side effect is triggered.

As to why I wrap this into {"dummy": config}, it's just to have the same code below for configs in dicts.

sayakpaul · 2025-02-02T15:08:17Z

src/peft/utils/hotswap.py

+                # TODO conv2d
+                raise NotImplementedError
+            if old_val.shape[0] > new_val.shape[0]:
+                old_val.data.fill_(0)


(nit): Would it make sense to have 0 assigned as a PAD_VALUE var and use it?

We would never change that value, right? It must be 0.

sayakpaul · 2025-02-02T15:09:45Z

tests/test_gpu_examples.py

@@ -4172,18 +4173,19 @@ def test_hotswapping_compiled_model_does_not_trigger_recompilation(self):
        assert "__recompiles" in stderr.decode()

    @pytest.mark.xfail(strict=True, reason="Requires hotswap to be implemented in diffusers")
-    def test_hotswapping_compiled_diffusion_model_does_not_trigger_recompilation(self):
+    @pytest.mark.parametrize("ranks", ["7,13", "13,7"])  # the ranks of the 2 LoRAs as str


Should we also check for different alphas?

In run_compiled_model_hotswap.py, we set:

config0 = LoraConfig(init_lora_weights=False, r=rank0, lora_alpha=alpha0, target_modules=["q_proj", "v_proj"]) config1 = LoraConfig(init_lora_weights=False, r=rank1, lora_alpha=alpha1, target_modules=["q_proj"])

so the alphas are indeed different.

sayakpaul · 2025-02-02T15:14:11Z

tests/test_initialization.py

+            elif "lora_A" in name and name.endswith(".bias"):
+                assert False, "LoRA A should not have a bias term"
+            elif "lora_B" in name and name.endswith(".bias"):
+                assert param.shape[0] == 10  # output shape of conv layer


Should we also add a test for checking if the correct config was updated?

We have test_prepare_model_for_compiled_hotswap_scalings_update_config above, is there something more that should be tested with regard to configs?

Perhaps alse for alphas? IIUC it only checks for ranks.

Hmm, but alpha values are not changed, only converted to tensors. So there is nothing to update in the config, is there?

- Allow 2nd adapter to target subset of 1st - Update tests to reflect this - Improve docs

BenjaminBossan

Thanks for the feedback, Sayak. LMK if I addressed all your points.

BenjaminBossan · 2025-02-03T13:43:44Z

docs/source/package_reference/hotswap.md

+# load lora 0
+model = PeftModel.from_pretrained(model, <path-adapter-0>)
+# Prepare the model to allow hotswapping even if ranks/scalings of 2nd adapter differ.
+# You can skip this step if all ranks and scalings are identical.
+prepare_model_for_compiled_hotswap(model, target_rank=max_rank)


I was thinking about how an application would use this in practice for the reason you mentioned. If we break it down like this:

Some business use case of serving different LoRAs: They can probably determine the max rank ahead of time and set it correctly.

An image generation app like Comfy: This is hard as they cannot know in advance what LoRAs the user plans on loading. Theoretically, all available LoRAs could be scanned (which might be expensive) and the largest rank be chosen (which might be total overkill and damage runtime performance).

If we had a infer_max_rank_from_state_dict() it would be for the 2nd case, which has the mentioned issues, but correct me if I'm wrong. But regardless of that, where would the state dicts come from? Since we have multiple LoRA formats in the diffusion ecosystem and diffusers knows how to deal with them, would that function not be better added to diffusers?

BenjaminBossan · 2025-02-03T14:29:12Z

docs/source/package_reference/hotswap.md

- The adapters must be compatible (e.g. same LoRA alpha, same target modules).
- If you use `torch.compile` and want to avoid recompilation, the LoRA rank must be the same.
+- It only works for the same PEFT method, so no swapping LoRA and LoHa, for example.
+- The adapters must be compatible (e.g. same target modules).


We have 2 scenarios when the target modules are not identical:

The 2nd adapter targets a subset of the 1st adapter: We can theoretically swap in the 2nd adapter, since there will be a LoRA layer in place where it is needed. However, the results will be wrong because the we will also have unneeded LoRA layers (e.g. LoRA 1 targets foo and bar and LoRA 2 only targets bar, then foo is still present). Therefore, we need to ensure that the missing weights are zeroed out. I added logic and tests for this.

The 2nd adapter targets a superset of layers or layers are partly disjoint: In this case, we would not to create new LoRA layers. This logic is not implemented (yet). But even if added, it would require recompilation and is thus not very useful.

This limits what the user can do with this feature but I don't have a good proposal how to solve it, except for asking users to swap in the LoRA that targets most layers first.

BenjaminBossan · 2025-02-03T14:31:31Z

src/peft/utils/hotswap.py

+        scaling = module.scaling
+        for key, val in scaling.items():
+            if isinstance(val, float):
+                scaling[key] = torch.tensor(val, device=module.weight.device)


I did some quick testing of matrix x scalar multiplication and it seems like torch always coerces to the dtype of the matrix, so we should be fine.

BenjaminBossan · 2025-02-03T14:32:44Z

src/peft/utils/hotswap.py

+    if is_lora_A:
+        # LoRA A affects out_features
+        padded = torch.zeros(target_rank, in_features, device=weight.device, dtype=weight.dtype)
+        padded[:original_rank, :] = weight
+        new_layer = torch.nn.Linear(in_features, target_rank, bias=lora_module.bias is not None)
+    else:
+        # LoRA B affects in_features
+        padded = torch.zeros(out_features, target_rank, device=weight.device, dtype=weight.dtype)
+        padded[:, :original_rank] = weight
+        new_layer = torch.nn.Linear(target_rank, out_features, bias=lora_module.bias is not None)


Not sure if I understand. The LoRA A and B layers are always nn.Linear, no matter if the base layer is quantized or not.

BenjaminBossan · 2025-02-03T14:40:56Z

src/peft/utils/hotswap.py

+    if lora_module.bias is not None:
+        new_layer.bias.data = lora_module.bias.data


Hmm, I tried but got:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function linear>(*(FakeTensor(..., device='cuda:0', size=(10, 1, 16)), Parameter(FakeTensor(..., size=(13, 16), requires_grad=True)), None), **{}): Unhandled FakeTensor Device Propagation for aten.mm.default, found two different devices cuda:0, cpu

Not sure why that is, since we take care of the device, but I reverted the change. I think in this specific situation, there is no practical benefit to .copy_ anyway, is there?

BenjaminBossan · 2025-02-03T14:46:15Z

src/peft/utils/hotswap.py

+    is_compiled = hasattr(model, "_orig_mod")
+    if is_compiled:
+        raise ValueError("Call prepare_model_for_compiled_hotswap *before* compiling the model")


Not sure if we can do anything about that. Any suggestion?

BenjaminBossan · 2025-02-03T14:47:38Z

src/peft/utils/hotswap.py

+        # config can be either a PeftConfig, or a dict of PeftConfigs like PeftModel.peft_config
+        config = {"dummy": config}


The {"dummy": config} is not assigned, but the passed in config argument is still mutated, so the intended side effect is triggered.

As to why I wrap this into {"dummy": config}, it's just to have the same code below for configs in dicts.

BenjaminBossan · 2025-02-03T14:48:22Z

src/peft/utils/hotswap.py

+                # TODO conv2d
+                raise NotImplementedError
+            if old_val.shape[0] > new_val.shape[0]:
+                old_val.data.fill_(0)


We would never change that value, right? It must be 0.

BenjaminBossan · 2025-02-03T14:50:21Z

tests/test_gpu_examples.py

@@ -4172,18 +4173,19 @@ def test_hotswapping_compiled_model_does_not_trigger_recompilation(self):
        assert "__recompiles" in stderr.decode()

    @pytest.mark.xfail(strict=True, reason="Requires hotswap to be implemented in diffusers")
-    def test_hotswapping_compiled_diffusion_model_does_not_trigger_recompilation(self):
+    @pytest.mark.parametrize("ranks", ["7,13", "13,7"])  # the ranks of the 2 LoRAs as str


In run_compiled_model_hotswap.py, we set:

config0 = LoraConfig(init_lora_weights=False, r=rank0, lora_alpha=alpha0, target_modules=["q_proj", "v_proj"]) config1 = LoraConfig(init_lora_weights=False, r=rank1, lora_alpha=alpha1, target_modules=["q_proj"])

so the alphas are indeed different.

BenjaminBossan · 2025-02-03T14:51:44Z

tests/test_initialization.py

+            elif "lora_A" in name and name.endswith(".bias"):
+                assert False, "LoRA A should not have a bias term"
+            elif "lora_B" in name and name.endswith(".bias"):
+                assert param.shape[0] == 10  # output shape of conv layer


We have test_prepare_model_for_compiled_hotswap_scalings_update_config above, is there something more that should be tested with regard to configs?

sayakpaul

Thanks, Benjamin! Just last few comments. But nothing merge-blocking.

BenjaminBossan

@sayakpaul I replied to your comments, please check again.

BenjaminBossan · 2025-02-04T10:05:04Z

src/peft/utils/hotswap.py

+        scaling = module.scaling
+        for key, val in scaling.items():
+            if isinstance(val, float):
+                scaling[key] = torch.tensor(val, device=module.weight.device)


BenjaminBossan · 2025-02-04T10:07:59Z

src/peft/utils/hotswap.py

+    if is_lora_A:
+        # LoRA A affects out_features
+        padded = torch.zeros(target_rank, in_features, device=weight.device, dtype=weight.dtype)
+        padded[:original_rank, :] = weight
+        new_layer = torch.nn.Linear(in_features, target_rank, bias=lora_module.bias is not None)
+    else:
+        # LoRA B affects in_features
+        padded = torch.zeros(out_features, target_rank, device=weight.device, dtype=weight.dtype)
+        padded[:, :original_rank] = weight
+        new_layer = torch.nn.Linear(target_rank, out_features, bias=lora_module.bias is not None)


So if the model has a linear layer that is targeted with LoRA, we check the type of linear layer and replace it with a normal LoRA Linear, with a bnb Linear4bit, etc. These PEFT layers have lora_A and lora_B sub-modules, but those are always nn.Linear.

In this padding function, we pad the lora_A and lora_B layers, thus we know that they're always nn.Linear, regardless of what the type of the original targeted layer was.

I added small comments for that.

BenjaminBossan · 2025-02-04T10:17:05Z

src/peft/utils/hotswap.py

+    is_compiled = hasattr(model, "_orig_mod")
+    if is_compiled:
+        raise ValueError("Call prepare_model_for_compiled_hotswap *before* compiling the model")


You mean:

- if is_compiled: + if is_compiled or torch.compiler.is_compiling():

IIUC, this would not help. We can call this function inside, say, a forward method of an nn.Module to check if the module is being compiled. However, prepare_model_for_compiled_hotswap is not called inside the module. The result of torch.compiler.is_compiling() would thus always return False (except if someone called it inside forward and then compiles that module, but that would be very incorrect).

BenjaminBossan · 2025-02-04T10:19:47Z

tests/test_initialization.py

+            elif "lora_A" in name and name.endswith(".bias"):
+                assert False, "LoRA A should not have a bias term"
+            elif "lora_B" in name and name.endswith(".bias"):
+                assert param.shape[0] == 10  # output shape of conv layer


Hmm, but alpha values are not changed, only converted to tensors. So there is nothing to update in the config, is there?

sayakpaul

Just a single comment. Thanks for the all hard work and iterations!

The recompilation tests so far called a subprocess and then checked the torch logs. This is cumbersome. Now, the tests have been simplified by setting the torch._dynamo.config.error_on_recompile flag to True. This way, we can check for presence/absence of this error instead of having to check the logs. As a consequence, the standalone scripts could be removed and all the code now resides within the test class. Unfortunately, a side effect change is that now, all tests run in the same process. This is problematic because torch.compile caches compile artifacts in the same process, which leads to errors when we use the same compiled model across multiple tests. I could verify that this is the case by running the tests individually, which prevented the compilation error. I tried a few things to mitigate this, but none helped: - Setting torch._inductor.config.force_disable_caches = True - Setting TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 - Setting a separate TORCHINDUCTOR_CACHE_DIR for each test Thus I resorted to using a different model for each test. There is a function to check for this and raise an error if the same model is used twice.

githubnemo

I'm OK with having different models for different scenarios to work around the dynamo cache but there seems to be infrastructure in place that could be used to work around this.

I think that the recompilation detection itself utilizes the caching so disabling caching will not work either way.

githubnemo · 2025-02-05T12:19:06Z

tests/test_gpu_examples.py

+    def raise_error_on_recompile(self):
+        """Raise an error when torch recompiles in the context.
+
+        Raises a torch._dynamo.exc.RecompileError error.
+        """
+        prev_value = torch._dynamo.config.error_on_recompile
+        torch._dynamo.config.error_on_recompile = True
+        try:
+            yield
+        finally:
+            torch._dynamo.config.error_on_recompile = prev_value


I think you can replace this with torch._dynamo.config.patch(error_on_recompile=True) if you want. See https://github.com/pytorch/pytorch/blob/83c0763dda1f93c6cf552ba88260a0dc7a3ecb70/torch/utils/_config_module.py#L226

githubnemo · 2025-02-05T12:30:20Z

tests/test_gpu_examples.py

+    # LLM #
+    #######
+
+    def check_hotswap(self, do_hotswap, model_id, ranks, alpha_scalings):


Maybe introduce a context manager that calls torch._dynamo.reset() to reset the caches will work around the need for different models?

BenjaminBossan · 2025-02-05T12:46:49Z

@githubnemo Very nice finds, both of your suggestions work and are much better than the workarounds I came up with.

... instead of at the start of the test.

BenjaminBossan added 2 commits October 24, 2024 17:16

[WIP] Allow different alphas for hotswapping

355e2ab

Only enable hotswap tests in CI for now

135dfda

BenjaminBossan added 5 commits October 24, 2024 17:25

More debugging

9b7d979

Merge branch 'main' into hotswap-allow-different-alpha-scalings

3a6d4b5

More debugging

088a9d9

Take alpha from new config

3da33ba

Remove some debug code

0bba669

BenjaminBossan added the wip label Nov 25, 2024

BenjaminBossan added 4 commits November 28, 2024 16:48

Merge branch 'main' into hotswap-allow-different-alpha-scalings

c73dafc

Compiled hotswapping works w/ different scalings

552ecac

[WIP] Hotswap with different ranks

2dcd6a1

Smaller to larger rank works, larger to smaller still fails.

Make hotswapping with different ranks work

620d43e

BenjaminBossan mentioned this pull request Nov 29, 2024

[LoRA] Implement hot-swapping of LoRA huggingface/diffusers#9453

Open

6 tasks

BenjaminBossan added 2 commits December 2, 2024 17:14

More tests, docs, fixes

efaf2b0

Make style, fix

25d7a77

BenjaminBossan changed the title ~~Hotswap allow different alpha scalings~~ Hotswap allow different alpha scalings and ranks Dec 11, 2024

BenjaminBossan added 5 commits January 8, 2025 13:36

Merge branch 'main' into hotswap-allow-different-alpha-scalings

88bd814

Remove obsolete test

2a29370

Differing alphas is no longer cause to raise an error.

Merge branch 'main' into hotswap-allow-different-alpha-scalings

9b132ec

Disable test skips for CI (only testing)

ddf1bad

Clean up, documentation

3251c6c

BenjaminBossan marked this pull request as ready for review January 28, 2025 15:07

githubnemo reviewed Jan 30, 2025

View reviewed changes

BenjaminBossan added 2 commits January 30, 2025 18:16

Reviewer feedback: Refactor _pad_lora_weights

9966c10

Reviewer feedback: Add clarifying comments

01924ad

BenjaminBossan commented Jan 30, 2025

View reviewed changes

githubnemo reviewed Jan 31, 2025

View reviewed changes

src/peft/utils/hotswap.py Outdated Show resolved Hide resolved

src/peft/utils/hotswap.py Show resolved Hide resolved

BenjaminBossan added 2 commits January 31, 2025 16:30

Fix issue with bias term

66b987e

Clarify docstring

d32d23f

BenjaminBossan commented Jan 31, 2025

View reviewed changes

src/peft/utils/hotswap.py Show resolved Hide resolved

src/peft/utils/hotswap.py Outdated Show resolved Hide resolved

src/peft/utils/hotswap.py Show resolved Hide resolved

sayakpaul reviewed Feb 2, 2025

View reviewed changes

githubnemo approved these changes Feb 3, 2025

View reviewed changes

Reviewer feedback:

8b7e602

- Allow 2nd adapter to target subset of 1st - Update tests to reflect this - Improve docs

BenjaminBossan commented Feb 3, 2025

View reviewed changes

sayakpaul approved these changes Feb 4, 2025

View reviewed changes

Reviewer feedback: add clarifying comments

d0a1fc5

BenjaminBossan commented Feb 4, 2025

View reviewed changes

sayakpaul approved these changes Feb 4, 2025

View reviewed changes

BenjaminBossan added 2 commits February 4, 2025 12:41

Reviewer feedback: address model.compile()

7772e2e

Revert testing

fe19da5

githubnemo approved these changes Feb 4, 2025

View reviewed changes

githubnemo reviewed Feb 5, 2025

View reviewed changes

Reviewer feedback: Better handling of torch cache

a412f7a

Change fixture to clean up dynamo cache after test

0943519

... instead of at the start of the test.

BenjaminBossan merged commit eaab05e into huggingface:main Feb 5, 2025
14 checks passed

BenjaminBossan deleted the hotswap-allow-different-alpha-scalings branch February 5, 2025 17:04

		if lora_module.bias is not None:
		new_layer.bias.data = lora_module.bias.data

		# config can be either a PeftConfig, or a dict of PeftConfigs like PeftModel.peft_config
		config = {"dummy": config}

Hotswap allow different alpha scalings and ranks #2177

Hotswap allow different alpha scalings and ranks #2177

Conversation

BenjaminBossan commented Oct 24, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Oct 24, 2024

github-actions bot commented Nov 24, 2024

BenjaminBossan commented Nov 25, 2024

BenjaminBossan commented Jan 28, 2025

sayakpaul commented Jan 28, 2025

githubnemo left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

githubnemo left a comment

Choose a reason for hiding this comment

sayakpaul commented Jan 31, 2025 via email

BenjaminBossan left a comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

githubnemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Feb 5, 2025

BenjaminBossan commented Oct 24, 2024 •

edited

Loading

sayakpaul Feb 4, 2025 •

edited

Loading