Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely long context finetuning #1291

Open
GianlucaDeStefano opened this issue Nov 14, 2024 · 2 comments
Open

Extremely long context finetuning #1291

GianlucaDeStefano opened this issue Nov 14, 2024 · 2 comments

Comments

@GianlucaDeStefano
Copy link

GianlucaDeStefano commented Nov 14, 2024

Hi all, I am trying to fine-tune models in extremely long contexts.
I've tested the training setup below, and I managed to finetune:

  • llama3.1-1B with a max_sequence_length of 128 * 1024 tokens
  • Qwen2.5-Coder-1.5B-Instruct-bnb-4bit / Qwen2.5-Coder-0.5B-Instruct-bnb-4bit with a max_seq_length of 64 * 1024 tokens.

I would really like to reach a context length of 128K tokens also for Qwen; however, I get an OoO error (even for the smallest 0.5B model). Is there something else I can do to optimize training over long contexts?
Furthermore, why do I get no memory error when fine-tuning on llama3.1-1B, which has double the parameters?

My codebase is:


    # Load the model
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = MODEL_NAME, #  "unsloth/Llama-3.2-1B-Instruct" or "Qwen2.5-Coder-1.5B-Instruct-bnb-4bit"
        dtype = 'Bfloat16',
        load_in_4bit = load_in_4bit,
        max_seq_length=max_seq_length,
    )
        
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
        use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
        random_state = 3407,
        use_rslora = False,  # We support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )

    # Load the dataset, log its fingerprint and metadata and shuffle it
    dataset = Dataset.load_from_disk(DATASET_PATH)

    # Define the training config
    trainer = SFTTrainer(
        model = model,
        tokenizer = tokenizer,
        train_dataset = dataset['train'],
        eval_dataset= dataset['test'],
        dataset_text_field = "text",
        max_seq_length = tokenizer.model_max_length,
        data_collator = collator,
        dataset_num_proc=1,
        packing = False, # Can make training 5x faster for short sequences.
        args = TrainingArguments(
            per_device_train_batch_size = 1,
            per_device_eval_batch_size = 1,
            gradient_accumulation_steps = 8,
            eval_accumulation_steps=1,
            warmup_steps = 5,
            eval_steps = 4,
            num_train_epochs = 1,
            learning_rate = 2e-4,
            fp16 = not is_bfloat16_supported(),
            fp16_full_eval = True,
            bf16 = is_bfloat16_supported(),
            logging_steps = 1,
            optim = "adamw_8bit",
            weight_decay = 0.01,
            lr_scheduler_type = "linear",
            seed = 3407,
            output_dir = "outputs",
            report_to = "none", # Use this for WandB etc
            eval_strategy="steps"
            ),
    )

    trainer.train()

    ```
@GianlucaDeStefano GianlucaDeStefano changed the title Extremelly long context finetuning Extremely long context finetuning Nov 14, 2024
@danielhanchen
Copy link
Contributor

@GianlucaDeStefano So you're saying Llama 3.1 8b can reach 128K context lengths with Unsloth but a smaller one Qwen 1.5B can only do 64K?

@GianlucaDeStefano
Copy link
Author

GianlucaDeStefano commented Nov 15, 2024

LLama 3.1 1B can reach 128K tokens, while Qwn 0.5B can only reach 64K.
I'm aware that this may also depend on the internal settings of the models; I was wondering if there is any particular setting that can help me to further optimize the training when using super long examples (~128K tokens).

Despite trying various approaches, I’ve been unable to exceed these limits on a single GPU. Scaling up to multiple GPUs would address the issue, but unfortunately, Unsloth does not yet support multi-GPU training. Additionally, switching to other frameworks isn’t an option since handling such large sequence lengths requires tensor parallelism to fit individual layers into GPU memory—a process that is complex to configure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants