-
-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prior model preservation #505
base: master
Are you sure you want to change the base?
Conversation
@dxqbYD can you add examples? your examples are great. even though i couldnt make it work maybe after properly implemented it will work :D so examples of comparison and how you did setup your concepts |
samples can be found in these release notes of SimpleTuner: https://www.reddit.com/r/StableDiffusion/comments/1g2i13s/simpletuner_v112_now_with_masked_loss_training/ |
kohya implementation: kohya-ss/sd-scripts#1710 |
This sounds like a really good idea to add as an option. But it definitely needs a more generic implementation. There are two issues to solve DatasetHow do we select the regularization samples during training? This also needs to work with a higher batch size than 1. Ideally it would mix regularization samples and normal training samples within the same batch. Unhooking the LoRAEach model has different sub-modules. So we need a generic method of disabling the LoRA for the prior result. A function in the model class to enable/disable all LoRAs could work well. |
how do you intend on mixing regularisation and training samples in a single batch @Nerogar ? that seems like not trivial. the actual target is changed. |
The only difference between prior preservation and normal training is the prediction target. So what I would do is basically this:
|
yes, unfortunately it just doesn't have the same regularisation effect to do it that way. having an entire batch pull back toward the model works. |
what are you basing this on? what Nerogar describes above is what kohya has implemented. So if true, that would mean kohya's implementation doesn't work (as well) |
basing it on numerous tests we've run on a cluster of H100s over the last week |
It isn't obvious that this would work without captions, but it does. You can see samples in the reddit link above. The right-most column is without captions.
Yes, agreed. There are more use cases than captions in favor of having it as a separate concept, for example balancing the regularisation using the number of repeats. In some of my tests, 1:1 was too much. @bghira has also found using his implementation in SimpleTuner that even though it works with no external data, it works better against high-quality external data. |
okay thanks. any theory on why that would be? I don't see a theoretical reason for your finding that it works better on a separate batch: |
Could you please provide some evidence of this? I.e a significant enough amount of samples that your aren’t falling victim to seed rng it’s important to get this right |
if this turns out to be right, I'd recommend to implement a feature into the OT concepts like It would influence how the batches are built, and the first option would be how ST builds batches. This could be a useful feature on its own. For example, if you train 2 concepts, it can be beneficial to have 1 image of each concept in a batch, instead of the same concept twice, especially if the images in a concept are very similar. |
i dont have time, sorry, do it however works best for your codebase. |
This comment was marked as off-topic.
This comment was marked as off-topic.
nothing usable for OneTrainer users yet. I should mention that there was apparently a paper published proposing this technique in April of this year, I just didn't know about it: https://arxiv.org/pdf/2404.07554 |
Honestly, I would love to have this method implemented with higher batches. It seems to be the best preservation technique so far couple with some weight decay. |
I don't plan to finish this PR for now, because the teacher-model-student-model thing employed here is much more powerful than just simple prior knowledge preservation, and I want to explore this further. If anyone wants to finish it in the meantime, in kohya's code you can find what is necessary to apply it to mixed batches. |
it would be nice if someone could finish it and add batch size to it, i dont think kohyas one works well enough and cant seem to get good results, yours works pretty good. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Here are some early samples of what I was talking about. The first sample each is generated by Flux in its style, but the LoRA was trained not through training images but through PonyDiffusionXL6 as a trainer model. The second sample each is what the character looks like generated by PonyDiffusionXL The drawing style of Pony was intentionally not trained here. |
any update on this getting pushed to main branch? |
this PR won't get pushed to the main branch. on the general plan, lots of interesting training techniques coming, if and when I ever bring them to a stable state. |
This code can be used to preserve the prior model on prompts other than the trained captions. After several more tests I think this is worth implementing and a quite generic feature:
Let me know if I should provide more details here, which you can currently find on the OT discord.
There is a feature request for SimpleTuner here: bghira/SimpleTuner#1031
This is a draft PR only to determine the interest for a full PR. It only works with batch size one, only for Flux, only for LoRA, and only for transformer.
It could be implemented generically for all LoRA. With major effort, it could be implemented for Full Finetune, but to avoid having the full model in VRAM twice, pre-generation of reg steps predictions would be necessary.