Add keep_tokens_separator as alternative for keep_tokens #975
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, great job as always.
I propose this feature to be added; it's inspired by NovelAI tagging. They train their model by putting some important tags at the head of the tags and shuffle the rest.
Got this from their docs:
And this is also confirmed by finetunej.
![image](https://private-user-images.githubusercontent.com/50163983/286219961-08ffcabe-6f52-4f22-bc1b-1d7667096934.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMTY0ODgsIm5iZiI6MTczOTMxNjE4OCwicGF0aCI6Ii81MDE2Mzk4My8yODYyMTk5NjEtMDhmZmNhYmUtNmY1Mi00ZjIyLWJjMWItMWQ3NjY3MDk2OTM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDIzMjMwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFhMTg2YTE5OTk0MzM1ZDE3NjJkYWIxZTE5NWQwZWNlNWVhMjRlNTFhZGM2MWM1ZDZmMGI1YmI2N2JhZTU4NTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.C0ApbnNQsbEFs8owRjnb7Fl3624r9pwxFiD1mbxH9L8)
And we know that some Danbooru images have more than one tag in
tag_character_string
andtag_copyright_string
, as well as some of them having both1boy, 1girl
in one picture, so usingkeep_tokens
alone is not effective to 'mimic' NovelAI tagging.The
keep_tokens_separator
is proposed so we can keep tokens from being shuffled for different captions.For example:
Haven't tested for fine-tuning but I train some LoRA with this separator
link to model | link to datasets (5.65gb)
Thank you!