Prior model preservation #505

dxqbYD · 2024-10-11T05:23:37Z

This code can be used to preserve the prior model on prompts other than the trained captions. After several more tests I think this is worth implementing and a quite generic feature:

It does not require any regularization image data. It works even when using the same training data for the reg steps as for the regular training steps.
It does not require a regularization caption. An empty caption for the reg steps works, indicating that this can preserve all kinds of concepts and whatever you train on
Additionally, it might improve training results on the trained captions, but I am not sure about this yet.

Let me know if I should provide more details here, which you can currently find on the OT discord.
There is a feature request for SimpleTuner here: bghira/SimpleTuner#1031

This is a draft PR only to determine the interest for a full PR. It only works with batch size one, only for Flux, only for LoRA, and only for transformer.

It could be implemented generically for all LoRA. With major effort, it could be implemented for Full Finetune, but to avoid having the full model in VRAM twice, pre-generation of reg steps predictions would be necessary.

FurkanGozukara · 2024-10-11T13:53:06Z

@dxqbYD can you add examples? your examples are great. even though i couldnt make it work maybe after properly implemented it will work :D

so examples of comparison and how you did setup your concepts

dxqbYD · 2024-10-14T03:48:11Z

samples can be found in these release notes of SimpleTuner: https://www.reddit.com/r/StableDiffusion/comments/1g2i13s/simpletuner_v112_now_with_masked_loss_training/

dxqbYD · 2024-10-19T14:46:33Z

kohya implementation: kohya-ss/sd-scripts#1710

Nerogar · 2024-10-20T20:44:43Z

This sounds like a really good idea to add as an option. But it definitely needs a more generic implementation. There are two issues to solve

Dataset

How do we select the regularization samples during training? This also needs to work with a higher batch size than 1. Ideally it would mix regularization samples and normal training samples within the same batch.
"It does not require a regularization caption" I don't think this is strictly true. You need some kind of conditioning for the model. Not conditioning the model at all will probably significantly reduce the effect of this training method.
What do you think about adding a new flag to concepts that toggles this loss calculation for specific training samples? Then the user can decide whether to include captions or not, and which images to use.

Unhooking the LoRA

Each model has different sub-modules. So we need a generic method of disabling the LoRA for the prior result. A function in the model class to enable/disable all LoRAs could work well.

bghira · 2024-10-20T21:20:15Z

how do you intend on mixing regularisation and training samples in a single batch @Nerogar ? that seems like not trivial. the actual target is changed.

Nerogar · 2024-10-20T21:33:09Z

The only difference between prior preservation and normal training is the prediction target. So what I would do is basically this:

Find the samples in the batch where the prior_preservation flag is set to True
Calculate the prior prediction without the LoRA for those samples
Replace the target of the batch in those samples with the prior prediction
Calculate the loss without any modification

bghira · 2024-10-20T21:49:08Z

yes, unfortunately it just doesn't have the same regularisation effect to do it that way. having an entire batch pull back toward the model works.

dxqbYD · 2024-10-20T22:16:26Z

yes, unfortunately it just doesn't have the same regularisation effect to do it that way. having an entire batch pull back toward the model works.

what are you basing this on?

what Nerogar describes above is what kohya has implemented. So if true, that would mean kohya's implementation doesn't work (as well)

bghira · 2024-10-20T22:22:20Z

basing it on numerous tests we've run on a cluster of H100s over the last week

dxqbYD · 2024-10-20T22:23:51Z

How do we select the regularization samples during training? This also needs to work with a higher batch size than 1. Ideally it would mix regularization samples and normal training samples within the same batch. "It does not require a regularization caption" I don't think this is strictly true. You need some kind of conditioning for the model. Not conditioning the model at all will probably significantly reduce the effect of this training method.

It isn't obvious that this would work without captions, but it does. You can see samples in the reddit link above. The right-most column is without captions.

What do you think about adding a new flag to concepts that toggles this loss calculation for specific training samples? Then the user can decide whether to include captions or not, and which images to use.

Yes, agreed. There are more use cases than captions in favor of having it as a separate concept, for example balancing the regularisation using the number of repeats. In some of my tests, 1:1 was too much.

@bghira has also found using his implementation in SimpleTuner that even though it works with no external data, it works better against high-quality external data.

dxqbYD · 2024-10-20T22:27:18Z

basing it on numerous tests we've run on a cluster of H100s over the last week

okay thanks. any theory on why that would be? I don't see a theoretical reason for your finding that it works better on a separate batch:
reg gradients are tiny.
the regularisation described in the Dreambooth paper was always implemented in the same batch in the early scripts.
you could even argue that this type of contrastive training should work better in the same batch.

O-J1 · 2024-10-21T02:15:48Z

basing it on numerous tests we've run on a cluster of H100s over the last week

Could you please provide some evidence of this? I.e a significant enough amount of samples that your aren’t falling victim to seed rng

it’s important to get this right

dxqbYD · 2024-10-21T09:44:43Z

basing it on numerous tests we've run on a cluster of H100s over the last week

Could you please provide some evidence of this? I.e a significant enough amount of samples that your aren’t falling victim to seed rng

it’s important to get this right

if this turns out to be right, I'd recommend to implement a feature into the OT concepts like
"try to keep this concept separate from concept Y in batches"
and
"try to combine this concept with concept Y in batches"

It would influence how the batches are built, and the first option would be how ST builds batches.

This could be a useful feature on its own. For example, if you train 2 concepts, it can be beneficial to have 1 image of each concept in a batch, instead of the same concept twice, especially if the images in a concept are very similar.

bghira · 2024-10-21T13:17:34Z

i dont have time, sorry, do it however works best for your codebase.

dxqbYD · 2024-11-13T20:54:41Z

any update on this @dxqbYD

nothing usable for OneTrainer users yet.
more interesting experiments beyond just preserving prior knowledge of a separate prompt as above: It appears it can also be very useful when training a concept, controlling for what you don't want it to learn. The concept can then be mixed in by prompting, and even mixing with other independently trained LoRAs seems to work better then.

I should mention that there was apparently a paper published proposing this technique in April of this year, I just didn't know about it: https://arxiv.org/pdf/2404.07554
The authors have pointed this out at the PR of kohya's implementation. They coined it "Contrastive Adapter Training"

TheForgotten69 · 2024-12-19T00:15:54Z

Honestly, I would love to have this method implemented with higher batches. It seems to be the best preservation technique so far couple with some weight decay.

dxqbYD · 2024-12-19T12:18:51Z

Honestly, I would love to have this method implemented with higher batches. It seems to be the best preservation technique so far couple with some weight decay.

I don't plan to finish this PR for now, because the teacher-model-student-model thing employed here is much more powerful than just simple prior knowledge preservation, and I want to explore this further.

If anyone wants to finish it in the meantime, in kohya's code you can find what is necessary to apply it to mixed batches.
what I might submit eventually as a PR will be able to do this and more.

DriveHabits · 2024-12-23T23:12:52Z

it would be nice if someone could finish it and add batch size to it, i dont think kohyas one works well enough and cant seem to get good results, yours works pretty good.

dxqbYD · 2024-12-27T16:40:18Z

Honestly, I would love to have this method implemented with higher batches. It seems to be the best preservation technique so far couple with some weight decay.

I don't plan to finish this PR for now, because the teacher-model-student-model thing employed here is much more powerful than just simple prior knowledge preservation, and I want to explore this further.

If anyone wants to finish it in the meantime, in kohya's code you can find what is necessary to apply it to mixed batches. what I might submit eventually as a PR will be able to do this and more.

Here are some early samples of what I was talking about.

The first sample each is generated by Flux in its style, but the LoRA was trained not through training images but through PonyDiffusionXL6 as a trainer model. The second sample each is what the character looks like generated by PonyDiffusionXL

The drawing style of Pony was intentionally not trained here.

DarkViewAI · 2025-01-19T05:26:34Z

any update on this getting pushed to main branch?

dxqbYD · 2025-01-21T11:56:12Z

any update on this getting pushed to main branch?

this PR won't get pushed to the main branch.

on the general plan, lots of interesting training techniques coming, if and when I ever bring them to a stable state.
text sliders, image sliders, Redux-conditioned training, various contrastive training techniques I'm still experimenting with

initial 6a83157

d911196

Merge branch 'Nerogar:master' into prior_reg

8a92161

This comment was marked as off-topic.

Sign in to view

dxqbYD mentioned this pull request Nov 13, 2024

Differential Output Preservation loss for LoRA kohya-ss/sd-scripts#1710

Merged

dxqbYD mentioned this pull request Dec 19, 2024

Experimental Redux conditioning for Flux Lora training kohya-ss/sd-scripts#1838

Draft

This comment was marked as off-topic.

Sign in to view

Repository owner deleted a comment from FurkanGozukara Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prior model preservation #505

Prior model preservation #505

dxqbYD commented Oct 11, 2024

FurkanGozukara commented Oct 11, 2024

dxqbYD commented Oct 14, 2024

dxqbYD commented Oct 19, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

dxqbYD commented Oct 20, 2024 •

edited

Loading

O-J1 commented Oct 21, 2024 •

edited

Loading

dxqbYD commented Oct 21, 2024

bghira commented Oct 21, 2024

This comment was marked as off-topic.

dxqbYD commented Nov 13, 2024

TheForgotten69 commented Dec 19, 2024

dxqbYD commented Dec 19, 2024

DriveHabits commented Dec 23, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

dxqbYD commented Dec 27, 2024 •

edited

Loading

DarkViewAI commented Jan 19, 2025

dxqbYD commented Jan 21, 2025

Prior model preservation #505

Are you sure you want to change the base?

Prior model preservation #505

Conversation

dxqbYD commented Oct 11, 2024

FurkanGozukara commented Oct 11, 2024

dxqbYD commented Oct 14, 2024

dxqbYD commented Oct 19, 2024

Nerogar commented Oct 20, 2024

Dataset

Unhooking the LoRA

bghira commented Oct 20, 2024

Nerogar commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

bghira commented Oct 20, 2024

dxqbYD commented Oct 20, 2024

dxqbYD commented Oct 20, 2024 • edited Loading

O-J1 commented Oct 21, 2024 • edited Loading

dxqbYD commented Oct 21, 2024

bghira commented Oct 21, 2024

This comment was marked as off-topic.

dxqbYD commented Nov 13, 2024

TheForgotten69 commented Dec 19, 2024

dxqbYD commented Dec 19, 2024

DriveHabits commented Dec 23, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

dxqbYD commented Dec 27, 2024 • edited Loading

DarkViewAI commented Jan 19, 2025

dxqbYD commented Jan 21, 2025

dxqbYD commented Oct 20, 2024 •

edited

Loading

O-J1 commented Oct 21, 2024 •

edited

Loading

dxqbYD commented Dec 27, 2024 •

edited

Loading