Bad results from fine-tuning TowerInstruct-7B-v0.2 #162

HURIMOZ · 2024-12-17T02:03:47Z

HURIMOZ
Dec 17, 2024

Hi @francoishernandez, @vince62s,
Iʻve been trying, and trying, and trying to fine-tune TowerInstruct-7B-v0.1 with my bilingual datasets.
My prompt format for fine-tuning and validation files:

<|im_start|>system
You are a helpful translation assistant that translates between English and Tahitian.<|im_end|>
<|im_start|>user
Translate this to Tahitian: the race we ran today was a long one, a little over 19 miles<|im_end|>
<|im_start|>assistant
mea roa te hororaʻa tā mātou i horo i teie mahana, 19 maire e tiʻahapa<|im_end|>

My prompt format for inference:

<|im_start|>system
You are a helpful translation assistant that translates between English and Tahitian.<|im_end|>
<|im_start|>user
Translate this to Tahitian: the sun shone brightly through the window<|im_end|>

The output from the inference:

d -- Systemd is a dynamic, modular and disruptive system and service manager for Linux, Unix and BSD systems. Hinweis:｟newline｠English: systemd -- Systemd is a dynamic, modular and disruptive system and service manager for Linux, Unix and BSD systems.｟newline｠Deutsch: systemd -- Systemd ist ein dynamischer, modularer und disruptiver System- und Service-Manager für Linux, Unix und BSD-Systeme.

My fine-tuning config file:

# General settings
seed: 1234
share_vocab: true
save_data: "./models/finetune/TowerInstruct-7b-v0.2"
src_vocab: "./models/TowerInstruct-7b-v0.2/vocab.txt"
src_vocab_size: 32000

overwrite: true
report_every: 10

# datasets
data:
    bilingual:
        path_src: "./src-tgt-prompts.enty"
        weight: 5
    valid:
        path_src: "./src-tgt-prompts-val.enty"

skip_empty_level: silent

transforms: ['huggingface_tokenize']
transforms_configs:
  huggingface_tokenize:
    huggingface_model: "Unbabel/TowerInstruct-7B-v0.2"
    max_length: 512

training:
    # GPU dispatching
    world_size: 1
    gpu_ranks: [0]
    
    # Dropout settings
    dropout_steps: [0]
    dropout: [0.05]
    attention_dropout: [0.05]
    
    # Batching
    bucket_size: 8192
    num_workers: 48
    batch_type: "tokens"
    batch_size: 512
    valid_batch_size: 512
    batch_size_multiple: 1
    
    # Optimization
    compute_dtype: "fp16"
    optim: "adam"
    learning_rate: 0.0002  # Increased from 0.00002 to 0.00005 (5e-5)
    warmup_steps: 200
    decay_method: "noam"
    adam_beta2: 0.998
    accum_count: [4]
    max_grad_norm: 1.0
    label_smoothing: 0.0
    param_init_method: xavier_uniform
    normalization: "tokens"
    
    # Quantization
    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
    quant_type: "bnb_NF4"
    
    # LoRA settings
    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8
    lora_dropout: 0.05
    lora_alpha: 16
    lora_embedding: false
    
    # Training duration and checkpointing
    train_steps: 3000
    valid_steps: 100
    save_checkpoint_steps: 100
    keep_checkpoint: 10
    
    # Model paths
    train_from: "./models/TowerInstruct-7b-v0.2"
    model_path: "./finetune/TowerInstruct-7b-v0.2"

# Tensorboard logging
tensorboard: true
tensorboard_log_dir: "logs"
log_file: "logs/training.log"

I follow the Eole NLP TowerInstruct-7B-v0.1 recipen which I guess would work for the v0.2 version. During training, the Learning Rate is flat as a pancake, the accuracy stays in a flat range and wonʻt improve substantially. Then I merge the LoRA weights with the provided command and run inference.
See training logs here.
Please tell me what Iʻm doing wrong.

francoishernandez · 2024-12-17T13:46:07Z

francoishernandez
Dec 17, 2024
Maintainer

During training, the Learning Rate is flat as a pancake, the accuracy stays in a flat range and wonʻt improve substantially.

The learning rate is something that YOU are controlling. In your setup it's essentially zero because you are using the noam scheduler with a very low base learning rate. Either disable noam decay, or increase your base learning rate factor. The provided llama2 finetuning recipe is far from perfect, but you can see there it's not using noam decay. Alternatively, the "standard" transformer recipes (wmt17, wiki103) use noam decay, but with a higher base learning rate.

As I already advised earlier, you might want to start from a working setup, and then start changing things iteratively (e.g. standard llama2 finetuning recipe, then change to a "simple" translation task, then to your low resource task).
That is even more advised if you are still figuring out some concepts along the way.

0 replies

vince62s · 2024-12-18T16:32:34Z

vince62s
Dec 18, 2024
Maintainer

Post here the first 5 lines of your /src-tgt-prompts.enty file

1 reply

HURIMOZ Dec 24, 2024
Author

Hi Vince, here are the five lines:

<|im_start|> user｟newline｠Tahitian:｟newline｠everyone has duties to the community in which alone the free and full development of his personality is possible.｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠e hōpoiʻa tā te taʻata tātaʻitahi e amo nā te vaʻa mataʻeinaʻa i reira te tupuraʻa tiʻamā rahi mau o tō na ihotaʻata e mana ai.<|im_end|>｟newline｠

<|im_start|> user｟newline｠What is the Tahitian for:｟newline｠in the exercise of his rights and freedoms, everyone shall be subject only to such limitations as are determined by law solely for the purpose of securing due recognition and respect for the rights and freedoms of others and of meeting the just requirements of morality, public order and the general welfare in a democratic society.｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠i roto i te faʻaʻohiparaʻa i tō na mau tiʻamanaraʻa ʻe i roto atoʻa i te faʻafaufaʻaraʻa i tō na mau tiʻamāraʻa, ʻaita atu ʻe ʻōpaniraʻa tā te taʻata tātaʻitahi maori rā ʻo te mau ʻōpaniraʻa īa i faʻaautiʻahia e te ture nō te papaturu i te faʻariʻiraʻa ʻe i te faʻaturaraʻa i te mau tiʻamanaraʻa ʻe te mau tiʻamāraʻa o te tahi atu, ʻe nō te haʻafaufaʻa atoʻa i te mau tītauraʻa ʻaifāito a te ture haʻapaʻoraʻa ʻe a te ture hau, ʻe a te oraraʻa hau maitaʻi o te tāʻatoʻaraʻa i roto i te hōʻē ʻāviriraʻa taʻata upoʻo ʻaifāito.<|im_end|>｟newline｠

<|im_start|> user｟newline｠Translate:｟newline｠these rights and freedoms may in no case be exercised contrary to the purposes and principles of the united nations.｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠e ʻore teie mau tiʻamanaraʻa ʻe teie mau tiʻamāraʻa e mana noa aʻe i te rave faʻahaparaʻa i te mau ture haʻapaʻoraʻa a te mau vaʻaʻāiʻa ʻāmuitahia.<|im_end|>｟newline｠

<|im_start|> user｟newline｠English→Tahitian:｟newline｠article 30｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠ʻīrava 30<|im_end|>｟newline｠

<|im_start|> user｟newline｠Convert to Tahitian:｟newline｠nothing in this declaration may be interpreted as implying for any state, group or person any right to engage in any activity or to perform any act aimed at the destruction of any of the rights and freedoms set forth herein.｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠ʻeiaha roa te hōʻē noa aʻe faʻaueraʻa a teie fāʻiraʻa ʻia ʻiritihia ʻei tiʻamanaraʻa aʻe nō te hōʻē hau fenua, nō te hōʻē aupupu taʻata ʻaore rā nō te hōʻē taʻata hōʻē, nō te rave i te hōʻē ʻohipa nō te vāvāhi atu i te mau tiʻamanaraʻa ʻe te mau tiʻamāraʻa i porohia i roto nei.<|im_end|>｟newline｠

And for Inference:

<|im_start|> user｟newline｠Provide Tahitian translation:｟newline｠the sun shone brightly through the window｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠

<|im_start|> user｟newline｠Translate to Tahitian:｟newline｠The watershed management plan addresses multiple ecological concerns.｟newline｠<|im_end|>｟newline｠<|im_start|> assistant｟newline｠

vince62s · 2024-12-24T11:03:40Z

vince62s
Dec 24, 2024
Maintainer

Taking the second example (the first one is tricky with just short prompt)

<|im_start|>user｟newline｠What is the Tahitian for:｟newline｠in the exercise of his rights and freedoms, everyone shall be subject only to such limitations as are determined by law solely for the purpose of securing due recognition and respect for the rights and freedoms of others and of meeting the just requirements of morality, public order and the general welfare in a democratic society.｟newline｠<|im_end|>｟newline｠<|im_start|>assistant｟newline｠i roto i te faʻaʻohiparaʻa i tō na mau tiʻamanaraʻa ʻe i roto atoʻa i te faʻafaufaʻaraʻa i tō na mau tiʻamāraʻa, ʻaita atu ʻe ʻōpaniraʻa tā te taʻata tātaʻitahi maori rā ʻo te mau ʻōpaniraʻa īa i faʻaautiʻahia e te ture nō te papaturu i te faʻariʻiraʻa ʻe i te faʻaturaraʻa i te mau tiʻamanaraʻa ʻe te mau tiʻamāraʻa o te tahi atu, ʻe nō te haʻafaufaʻa atoʻa i te mau tītauraʻa ʻaifāito a te ture haʻapaʻoraʻa ʻe a te ture hau, ʻe a te oraraʻa hau maitaʻi o te tāʻatoʻaraʻa i roto i te hōʻē ʻāviriraʻa taʻata upoʻo ʻaifāito.

remove the space before user and assistant
2 remove the <|im_end|> and the ｟newline｠at the end
the <|im_end|> is the eos_token and added automatically by eole. The newline is not necessary

For inference remive the space before user and assistant.

Having said that, it may not work properly since the only tahitian the model will know is from your finetuning data so it is not so easy to learn in this context but give it a try.

0 replies

vince62s · 2024-12-24T11:05:54Z

vince62s
Dec 24, 2024
Maintainer

also use those:

# Optimization
    compute_dtype: "bf16"
    optim: "adamw"
    learning_rate: 0.0002
    warmup_steps: 200
    decay_method: "none"

1 reply

HURIMOZ Dec 25, 2024
Author

bf16 wonʻt work for me:

/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:1604: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(

So I had to revert to fp16.
But then, I got this cuda warning:

/home/ubuntu/TY-EN/eole/eole/utils/misc.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast()
/home/ubuntu/TY-EN/eole/eole/utils/misc.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  return torch.cuda.amp.autocast()
/home/ubuntu/TY-EN/eole/eole/utils/misc.py:111: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.

I know itʻs just a warning but maybe we can move toward a more device-agnostic approach?

Also, learning_rate set to 0.0002 gave me a flat-out curve so I reverted to 0.27 and saw what seems to be good convergence.

vince62s · 2024-12-24T11:06:38Z

vince62s
Dec 24, 2024
Maintainer

just train about 2000 steps and post the log

1 reply

HURIMOZ Dec 26, 2024
Author

Hi Vince, here are the logs.

[2024-12-25 09:05:45,324 INFO] Loading checkpoint from ./models/TowerInstruct-7b-v0.2
[2024-12-25 09:05:45,326 WARNING] You have a CUDA device, should run with -gpu_ranks
[2024-12-25 09:05:45,346 WARNING] You have a CUDA device, should run with -gpu_ranks
[2024-12-25 09:05:45,348 INFO] Option: tensorboard, value: True, overriding model: False
[2024-12-25 09:05:45,348 INFO] Option: overwrite, value: True, overriding model: False
[2024-12-25 09:05:45,348 INFO] Option: tensorboard_log_dir, value: logs, overriding model: runs/eole
[2024-12-25 09:05:45,348 INFO] Option: src_vocab_size, value: 32000, overriding model: 32007
[2024-12-25 09:05:45,348 INFO] Option: src_vocab, value: ./models/TowerInstruct-7b-v0.2/vocab.txt, overriding model: None
[2024-12-25 09:05:45,348 INFO] Option: log_file, value: logs/training.log, overriding model: 
[2024-12-25 09:05:45,348 INFO] Option: seed, value: 1234, overriding model: -1
[2024-12-25 09:05:45,348 INFO] Option: report_every, value: 10, overriding model: 50
[2024-12-25 09:05:45,348 INFO] Option: save_data, value: ./models/finetune/TowerInstruct-7b-v0.2, overriding model: None
[2024-12-25 09:05:45,348 INFO] Option: train_from, value: ./models/TowerInstruct-7b-v0.2, overriding model: None
[2024-12-25 09:05:45,348 INFO] Option: save_checkpoint_steps, value: 100, overriding model: 5000
[2024-12-25 09:05:45,349 INFO] Option: attention_dropout, value: [0.05], overriding model: [0.1]
[2024-12-25 09:05:45,349 INFO] Option: quant_type, value: bnb_NF4, overriding model: 
[2024-12-25 09:05:45,349 INFO] Option: lora_rank, value: 8, overriding model: 2
[2024-12-25 09:05:45,349 INFO] Option: warmup_steps, value: 200, overriding model: 4000
[2024-12-25 09:05:45,349 INFO] Option: lora_alpha, value: 16, overriding model: 1
[2024-12-25 09:05:45,349 INFO] Option: lora_dropout, value: 0.05, overriding model: 0.0
[2024-12-25 09:05:45,349 INFO] Option: valid_steps, value: 100, overriding model: 10000
[2024-12-25 09:05:45,349 INFO] Option: model_path, value: ./finetune/TowerInstruct-7b-v0.2, overriding model: model
[2024-12-25 09:05:45,349 INFO] Option: train_steps, value: 3000, overriding model: 100000
[2024-12-25 09:05:45,349 INFO] Option: learning_rate, value: 0.27, overriding model: 1.0
[2024-12-25 09:05:45,349 INFO] Option: dropout, value: [0.05], overriding model: [0.3]
[2024-12-25 09:05:45,349 INFO] Option: decay_method, value: noam, overriding model: none
[2024-12-25 09:05:45,349 INFO] Option: gpu_ranks, value: [0], overriding model: []
[2024-12-25 09:05:45,349 INFO] Option: optim, value: adam, overriding model: sgd
[2024-12-25 09:05:45,349 INFO] Option: compute_dtype, value: torch.float16, overriding model: torch.float32
[2024-12-25 09:05:45,349 INFO] Option: quant_layers, value: ['gate_up_proj', 'down_proj', 'up_proj', 'linear_values', 'linear_query', 'linear_keys', 'final_linear'], overriding model: []
[2024-12-25 09:05:45,349 INFO] Option: batch_size, value: 256, overriding model: 896
[2024-12-25 09:05:45,349 INFO] Option: bucket_size, value: 4096, overriding model: 262144
[2024-12-25 09:05:45,349 INFO] Option: lora_layers, value: ['linear_values', 'linear_query', 'linear_keys', 'final_linear'], overriding model: []
[2024-12-25 09:05:45,349 INFO] Option: num_workers, value: 16, overriding model: 2
[2024-12-25 09:05:45,349 INFO] Option: keep_checkpoint, value: 10, overriding model: -1
[2024-12-25 09:05:45,350 INFO] Default transforms (might be overridden downstream): ['huggingface_tokenize'].
[2024-12-25 09:05:45,350 INFO] Missing transforms field for bilingual data, set to default: ['huggingface_tokenize'].
[2024-12-25 09:05:45,350 INFO] Missing transforms field for valid data, set to default: ['huggingface_tokenize'].
[2024-12-25 09:05:45,351 INFO] Parsed 2 corpora from -data.
[2024-12-25 09:05:45,351 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2024-12-25 09:05:45,808 INFO] Initialized tokenizers from HF model: Unbabel/TowerInstruct-7B-v0.2
[2024-12-25 09:05:45,808 INFO] Transforms applied: ['huggingface_tokenize']
[2024-12-25 09:05:45,808 INFO] Keeping checkpoint vocabulary
[2024-12-25 09:05:45,842 INFO] The first 10 tokens of the vocabs are:['<unk>', '<s>', '</s>', '<0x00>', '<0x01>', '<0x02>', '<0x03>', '<0x04>', '<0x05>', '<0x06>']
[2024-12-25 09:05:45,842 INFO] The decoder start token is: <s>
[2024-12-25 09:05:45,842 INFO] bos_token token is: <s> id: [1]
[2024-12-25 09:05:45,842 INFO] eos_token token is: <|im_end|> id: [32005]
[2024-12-25 09:05:45,842 INFO] pad_token token is: <PAD> id: [32004]
[2024-12-25 09:05:45,842 INFO] unk_token token is: <unk> id: [0]
[2024-12-25 09:05:45,843 INFO] Building model...
[2024-12-25 09:05:45,919 INFO] bnb_NF4 compression of layer ['gate_up_proj', 'down_proj', 'up_proj']
[2024-12-25 09:05:46,246 INFO] Adding LoRa layers for linear_values quant bnb_NF4
[2024-12-25 09:05:46,264 INFO] Adding LoRa layers for linear_query quant bnb_NF4
[2024-12-25 09:05:46,282 INFO] Adding LoRa layers for linear_keys quant bnb_NF4
[2024-12-25 09:05:46,304 INFO] Adding LoRa layers for final_linear quant bnb_NF4
[2024-12-25 09:05:46,337 INFO] Switching model to float32 for amp/apex_amp
[2024-12-25 09:05:46,337 INFO] Non quantized layer compute is torch.float16
[2024-12-25 09:05:54,560 INFO] DecoderModel(
  (decoder): TransformerLMDecoder(
    (rope): RotaryPosition()
    (transformer_layers): ModuleList(
      (0-31): 32 x TransformerLMDecoderLayer(
        (input_layernorm): RMSNorm()
        (self_attn): SelfMHA(
          (linear_keys): QLoraLinear_cls(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): Dropout(p=0.05, inplace=False)
          )
          (linear_values): QLoraLinear_cls(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): Dropout(p=0.05, inplace=False)
          )
          (linear_query): QLoraLinear_cls(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): Dropout(p=0.05, inplace=False)
          )
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.05, inplace=False)
          (final_linear): QLoraLinear_cls(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): Dropout(p=0.05, inplace=False)
          )
        )
        (dropout): Dropout(p=0.05, inplace=False)
        (post_attention_layernorm): RMSNorm()
        (mlp): MLP(
          (gate_up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (dropout_1): Dropout(p=0.05, inplace=False)
          (dropout_2): Dropout(p=0.05, inplace=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
        )
      )
    )
    (layer_norm): RMSNorm()
  )
  (tgt_emb): Embeddings(
    (embeddings): Embedding(32007, 4096, padding_idx=32004)
    (dropout): Dropout(p=0.05, inplace=False)
  )
  (generator): Linear(in_features=4096, out_features=32007, bias=False)
)
[2024-12-25 09:05:54,569 INFO] embeddings: 131100672
[2024-12-25 09:05:54,569 INFO] encoder: 0
[2024-12-25 09:05:54,569 INFO] decoder: 3246657536
[2024-12-25 09:05:54,569 INFO] generator: 131100672
[2024-12-25 09:05:54,569 INFO] other: 0
[2024-12-25 09:05:54,569 INFO] * number of parameters: 3508858880
[2024-12-25 09:05:54,569 INFO] Trainable parameters = {'torch.float32': 8388608}
[2024-12-25 09:05:54,569 INFO] Non trainable parameters = {'torch.float32': 262467584, 'torch.uint8': 3238002688}
[2024-12-25 09:05:54,569 INFO]  * src vocab size = 32007
[2024-12-25 09:05:54,569 INFO]  * tgt vocab size = 32007
[2024-12-25 09:05:54,892 INFO] Initialized tokenizers from HF model: Unbabel/TowerInstruct-7B-v0.2
[2024-12-25 09:05:54,892 INFO] Transforms applied: ['huggingface_tokenize']
[2024-12-25 09:05:56,971 INFO] Starting training on GPU: [0]
[2024-12-25 09:05:56,971 INFO] Start training loop and validate every 100 steps...
[2024-12-25 09:05:56,972 INFO] Scoring with: {'huggingface_tokenize': HuggingfaceTokenizer()}
[2024-12-25 09:12:51,053 INFO] Step 10/ 3000; acc: 38.5; ppl:   nan; xent: nan; aux: 0.000; lr: 1.49e-05; sents:     888; bsz:  201/ 201/ 3; 155/155 tok/s;    414 sec;
[2024-12-25 09:18:15,920 INFO] Step 20/ 3000; acc: 39.0; ppl:   nan; xent: nan; aux: 0.000; lr: 2.98e-05; sents:     835; bsz:  205/ 205/ 3; 202/202 tok/s;    739 sec;
[2024-12-25 09:23:40,735 INFO] Step 30/ 3000; acc: 44.8; ppl: 48.65; xent: 3.88; aux: 0.000; lr: 4.47e-05; sents:     878; bsz:  205/ 205/ 3; 202/202 tok/s;   1064 sec;
[2024-12-25 09:29:11,323 INFO] Step 40/ 3000; acc: 51.1; ppl:   nan; xent: nan; aux: 0.000; lr: 5.97e-05; sents:     975; bsz:  220/ 220/ 3; 213/213 tok/s;   1394 sec;
[2024-12-25 09:34:37,197 INFO] Step 50/ 3000; acc: 56.7; ppl:   nan; xent: nan; aux: 0.000; lr: 7.46e-05; sents:     879; bsz:  209/ 209/ 3; 205/205 tok/s;   1720 sec;
[2024-12-25 09:40:01,130 INFO] Step 60/ 3000; acc: 61.4; ppl:   nan; xent: nan; aux: 0.000; lr: 8.95e-05; sents:     817; bsz:  204/ 204/ 3; 202/202 tok/s;   2044 sec;
[2024-12-25 09:45:27,968 INFO] Step 70/ 3000; acc: 71.5; ppl:  4.98; xent: 1.61; aux: 0.000; lr: 1.04e-04; sents:     936; bsz:  212/ 212/ 3; 207/207 tok/s;   2371 sec;
[2024-12-25 09:50:56,854 INFO] Step 80/ 3000; acc: 74.7; ppl:  3.94; xent: 1.37; aux: 0.000; lr: 1.19e-04; sents:     958; bsz:  218/ 218/ 3; 212/212 tok/s;   2700 sec;
[2024-12-25 09:56:27,107 INFO] Step 90/ 3000; acc: 73.6; ppl:  3.86; xent: 1.35; aux: 0.000; lr: 1.34e-04; sents:     910; bsz:  220/ 220/ 3; 213/213 tok/s;   3030 sec;
[2024-12-25 10:01:54,938 INFO] Step 100/ 3000; acc: 75.6; ppl:  3.47; xent: 1.24; aux: 0.000; lr: 1.49e-04; sents:     910; bsz:  215/ 215/ 3; 210/210 tok/s;   3358 sec;
[2024-12-25 10:09:21,782 INFO] valid stats calculation
                           took: 446.8406083583832 s.
[2024-12-25 10:09:21,785 INFO] Train perplexity: nan
[2024-12-25 10:09:21,785 INFO] Train accuracy: 58.9786
[2024-12-25 10:09:21,785 INFO] Sentences processed: 8986
[2024-12-25 10:09:21,785 INFO] Average bsz:  211/ 211/ 3
[2024-12-25 10:09:21,785 INFO] Validation perplexity: 6.33243
[2024-12-25 10:09:21,786 INFO] Validation accuracy: 60.2554
[2024-12-25 10:09:21,844 INFO] Saving optimizer and weights to step_100, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 10:09:22,186 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_100
[2024-12-25 10:09:22,186 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 10:14:50,191 INFO] Step 110/ 3000; acc: 73.2; ppl:  3.69; xent: 1.31; aux: 0.000; lr: 1.64e-04; sents:     834; bsz:  214/ 214/ 3;  88/ 88 tok/s;   4133 sec;
[2024-12-25 10:15:27,548 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,441 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,463 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,536 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,637 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,737 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,825 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,938 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:39,972 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,132 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,221 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,310 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,372 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,472 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,586 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,667 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,710 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,737 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,816 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:40,907 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:41,070 INFO] Step 112, cuda OOM - batch removed
[2024-12-25 10:15:44,955 INFO] Step 113, cuda OOM - batch removed
[2024-12-25 10:20:00,309 INFO] Step 120/ 3000; acc: 71.1; ppl:  4.03; xent: 1.39; aux: 0.000; lr: 1.79e-04; sents:     717; bsz:  243/ 213/ 2; 233/205 tok/s;   4443 sec;
[2024-12-25 10:25:23,679 INFO] Step 130/ 3000; acc: 72.7; ppl:  3.69; xent: 1.31; aux: 0.000; lr: 1.94e-04; sents:     755; bsz:  202/ 202/ 2; 200/200 tok/s;   4767 sec;
[2024-12-25 10:30:46,426 INFO] Step 140/ 3000; acc: 74.0; ppl:  3.43; xent: 1.23; aux: 0.000; lr: 2.09e-04; sents:     762; bsz:  199/ 199/ 2; 197/197 tok/s;   5089 sec;
[2024-12-25 10:36:10,741 INFO] Step 150/ 3000; acc: 76.0; ppl:  3.12; xent: 1.14; aux: 0.000; lr: 2.24e-04; sents:     832; bsz:  206/ 206/ 3; 203/203 tok/s;   5414 sec;
[2024-12-25 10:41:36,409 INFO] Step 160/ 3000; acc: 75.2; ppl:  3.20; xent: 1.16; aux: 0.000; lr: 2.39e-04; sents:     803; bsz:  208/ 208/ 3; 204/204 tok/s;   5739 sec;
[2024-12-25 10:47:02,775 INFO] Step 170/ 3000; acc: 73.8; ppl:  3.38; xent: 1.22; aux: 0.000; lr: 2.54e-04; sents:     772; bsz:  210/ 210/ 2; 206/206 tok/s;   6066 sec;
[2024-12-25 10:52:31,273 INFO] Step 180/ 3000; acc: 78.1; ppl:  2.79; xent: 1.02; aux: 0.000; lr: 2.68e-04; sents:     890; bsz:  215/ 215/ 3; 209/209 tok/s;   6394 sec;
[2024-12-25 10:57:58,794 INFO] Step 190/ 3000; acc: 78.9; ppl:  2.68; xent: 0.99; aux: 0.000; lr: 2.83e-04; sents:     901; bsz:  214/ 214/ 3; 209/209 tok/s;   6722 sec;
[2024-12-25 11:03:25,245 INFO] Step 200/ 3000; acc: 80.4; ppl:  2.51; xent: 0.92; aux: 0.000; lr: 2.98e-04; sents:     926; bsz:  210/ 210/ 3; 206/206 tok/s;   7048 sec;
[2024-12-25 11:10:53,634 INFO] valid stats calculation
                           took: 448.3858561515808 s.
[2024-12-25 11:10:53,639 INFO] Train perplexity: nan
[2024-12-25 11:10:53,639 INFO] Train accuracy: 67.1149
[2024-12-25 11:10:53,639 INFO] Sentences processed: 17178
[2024-12-25 11:10:53,639 INFO] Average bsz:  211/ 210/ 3
[2024-12-25 11:10:53,639 INFO] Validation perplexity: 4.46385
[2024-12-25 11:10:53,640 INFO] Validation accuracy: 65.4988
[2024-12-25 11:10:53,706 INFO] Saving optimizer and weights to step_200, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 11:10:54,057 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_200
[2024-12-25 11:10:54,057 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 11:16:16,225 INFO] Step 210/ 3000; acc: 73.8; ppl:  3.28; xent: 1.19; aux: 0.000; lr: 2.91e-04; sents:     690; bsz:  198/ 198/ 2;  82/ 82 tok/s;   7819 sec;
[2024-12-25 11:21:45,314 INFO] Step 220/ 3000; acc: 76.7; ppl:  2.89; xent: 1.06; aux: 0.000; lr: 2.84e-04; sents:     813; bsz:  215/ 215/ 3; 209/209 tok/s;   8148 sec;
[2024-12-25 11:27:14,903 INFO] Step 230/ 3000; acc: 81.5; ppl:  2.37; xent: 0.86; aux: 0.000; lr: 2.78e-04; sents:     968; bsz:  220/ 220/ 3; 214/214 tok/s;   8478 sec;
[2024-12-25 11:32:41,210 INFO] Step 240/ 3000; acc: 80.7; ppl:  2.45; xent: 0.90; aux: 0.000; lr: 2.72e-04; sents:     896; bsz:  210/ 210/ 3; 206/206 tok/s;   8804 sec;
[2024-12-25 11:38:07,038 INFO] Step 250/ 3000; acc: 77.3; ppl:  2.79; xent: 1.03; aux: 0.000; lr: 2.67e-04; sents:     783; bsz:  207/ 207/ 2; 204/204 tok/s;   9130 sec;
[2024-12-25 11:43:30,938 INFO] Step 260/ 3000; acc: 77.3; ppl:  2.78; xent: 1.02; aux: 0.000; lr: 2.62e-04; sents:     761; bsz:  203/ 203/ 2; 200/200 tok/s;   9454 sec;
[2024-12-25 11:48:58,040 INFO] Step 270/ 3000; acc: 80.6; ppl:  2.43; xent: 0.89; aux: 0.000; lr: 2.57e-04; sents:     893; bsz:  212/ 212/ 3; 207/207 tok/s;   9781 sec;
[2024-12-25 11:54:26,387 INFO] Step 280/ 3000; acc: 79.5; ppl:  2.53; xent: 0.93; aux: 0.000; lr: 2.52e-04; sents:     852; bsz:  213/ 213/ 3; 207/207 tok/s;  10109 sec;
[2024-12-25 11:59:52,651 INFO] Step 290/ 3000; acc: 79.5; ppl:  2.51; xent: 0.92; aux: 0.000; lr: 2.48e-04; sents:     823; bsz:  208/ 208/ 3; 204/204 tok/s;  10436 sec;
[2024-12-25 12:05:19,863 INFO] Step 300/ 3000; acc: 81.4; ppl:  2.32; xent: 0.84; aux: 0.000; lr: 2.44e-04; sents:     892; bsz:  212/ 212/ 3; 207/207 tok/s;  10763 sec;
[2024-12-25 12:12:48,300 INFO] valid stats calculation
                           took: 448.4327290058136 s.
[2024-12-25 12:12:48,302 INFO] Train perplexity: nan
[2024-12-25 12:12:48,302 INFO] Train accuracy: 71.0498
[2024-12-25 12:12:48,302 INFO] Sentences processed: 25549
[2024-12-25 12:12:48,302 INFO] Average bsz:  211/ 210/ 3
[2024-12-25 12:12:48,303 INFO] Validation perplexity: 3.68957
[2024-12-25 12:12:48,303 INFO] Validation accuracy: 69.3912
[2024-12-25 12:12:48,357 INFO] Saving optimizer and weights to step_300, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 12:12:48,705 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_300
[2024-12-25 12:12:48,705 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 12:18:14,276 INFO] Step 310/ 3000; acc: 80.4; ppl:  2.42; xent: 0.88; aux: 0.000; lr: 2.40e-04; sents:     843; bsz:  207/ 207/ 3;  85/ 85 tok/s;  11537 sec;
[2024-12-25 12:23:42,775 INFO] Step 320/ 3000; acc: 81.5; ppl:  2.32; xent: 0.84; aux: 0.000; lr: 2.36e-04; sents:     904; bsz:  216/ 216/ 3; 210/210 tok/s;  11866 sec;
[2024-12-25 12:29:11,404 INFO] Step 330/ 3000; acc: 81.7; ppl:  2.31; xent: 0.84; aux: 0.000; lr: 2.32e-04; sents:     911; bsz:  217/ 217/ 3; 211/211 tok/s;  12194 sec;
[2024-12-25 12:34:40,221 INFO] Step 340/ 3000; acc: 80.1; ppl:  2.44; xent: 0.89; aux: 0.000; lr: 2.29e-04; sents:     845; bsz:  215/ 215/ 3; 209/209 tok/s;  12523 sec;
[2024-12-25 12:40:08,944 INFO] Step 350/ 3000; acc: 80.0; ppl:  2.46; xent: 0.90; aux: 0.000; lr: 2.26e-04; sents:     838; bsz:  213/ 213/ 3; 208/208 tok/s;  12852 sec;
[2024-12-25 12:45:36,761 INFO] Step 360/ 3000; acc: 79.4; ppl:  2.51; xent: 0.92; aux: 0.000; lr: 2.22e-04; sents:     797; bsz:  212/ 212/ 2; 207/207 tok/s;  13180 sec;
[2024-12-25 12:51:03,655 INFO] Step 370/ 3000; acc: 82.2; ppl:  2.25; xent: 0.81; aux: 0.000; lr: 2.19e-04; sents:     887; bsz:  211/ 211/ 3; 206/206 tok/s;  13507 sec;
[2024-12-25 12:56:27,203 INFO] Step 380/ 3000; acc: 81.7; ppl:  2.28; xent: 0.82; aux: 0.000; lr: 2.16e-04; sents:     843; bsz:  202/ 202/ 3; 200/200 tok/s;  13830 sec;
[2024-12-25 13:01:51,867 INFO] Step 390/ 3000; acc: 80.1; ppl:  2.44; xent: 0.89; aux: 0.000; lr: 2.14e-04; sents:     799; bsz:  204/ 204/ 2; 201/201 tok/s;  14155 sec;
[2024-12-25 13:07:15,393 INFO] Step 400/ 3000; acc: 80.3; ppl:  2.39; xent: 0.87; aux: 0.000; lr: 2.11e-04; sents:     766; bsz:  202/ 202/ 2; 199/199 tok/s;  14478 sec;
[2024-12-25 13:14:43,679 INFO] valid stats calculation
                           took: 448.28301882743835 s.
[2024-12-25 13:14:43,683 INFO] Train perplexity: nan
[2024-12-25 13:14:43,683 INFO] Train accuracy: 73.475
[2024-12-25 13:14:43,684 INFO] Sentences processed: 33982
[2024-12-25 13:14:43,684 INFO] Average bsz:  211/ 210/ 3
[2024-12-25 13:14:43,684 INFO] Validation perplexity: 3.32406
[2024-12-25 13:14:43,684 INFO] Validation accuracy: 71.5468
[2024-12-25 13:14:43,741 INFO] Saving optimizer and weights to step_400, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 13:14:44,119 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_400
[2024-12-25 13:14:44,119 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 13:20:10,206 INFO] Step 410/ 3000; acc: 80.7; ppl:  2.38; xent: 0.87; aux: 0.000; lr: 2.08e-04; sents:     822; bsz:  208/ 208/ 3;  86/ 86 tok/s;  15253 sec;
[2024-12-25 13:25:38,213 INFO] Step 420/ 3000; acc: 78.5; ppl:  2.56; xent: 0.94; aux: 0.000; lr: 2.06e-04; sents:     743; bsz:  213/ 213/ 2; 207/207 tok/s;  15581 sec;
[2024-12-25 13:31:05,755 INFO] Step 430/ 3000; acc: 80.3; ppl:  2.41; xent: 0.88; aux: 0.000; lr: 2.03e-04; sents:     794; bsz:  210/ 210/ 2; 205/205 tok/s;  15909 sec;
[2024-12-25 13:36:37,179 INFO] Step 440/ 3000; acc: 82.5; ppl:  2.21; xent: 0.79; aux: 0.000; lr: 2.01e-04; sents:     929; bsz:  222/ 222/ 3; 214/214 tok/s;  16240 sec;
[2024-12-25 13:42:05,580 INFO] Step 450/ 3000; acc: 83.5; ppl:  2.12; xent: 0.75; aux: 0.000; lr: 1.99e-04; sents:     936; bsz:  215/ 215/ 3; 210/210 tok/s;  16569 sec;
[2024-12-25 13:47:30,031 INFO] Step 460/ 3000; acc: 81.5; ppl:  2.28; xent: 0.82; aux: 0.000; lr: 1.97e-04; sents:     808; bsz:  205/ 205/ 3; 202/202 tok/s;  16893 sec;
[2024-12-25 13:52:57,765 INFO] Step 470/ 3000; acc: 82.6; ppl:  2.20; xent: 0.79; aux: 0.000; lr: 1.95e-04; sents:     866; bsz:  212/ 212/ 3; 207/207 tok/s;  17221 sec;
[2024-12-25 13:58:25,362 INFO] Step 480/ 3000; acc: 80.6; ppl:  2.36; xent: 0.86; aux: 0.000; lr: 1.93e-04; sents:     776; bsz:  211/ 211/ 2; 206/206 tok/s;  17548 sec;
[2024-12-25 14:03:51,547 INFO] Step 490/ 3000; acc: 83.7; ppl:  2.11; xent: 0.75; aux: 0.000; lr: 1.91e-04; sents:     894; bsz:  210/ 210/ 3; 206/206 tok/s;  17875 sec;
[2024-12-25 14:09:19,710 INFO] Step 500/ 3000; acc: 80.9; ppl:  2.35; xent: 0.85; aux: 0.000; lr: 1.89e-04; sents:     804; bsz:  212/ 212/ 3; 206/206 tok/s;  18203 sec;
[2024-12-25 14:16:48,276 INFO] valid stats calculation
                           took: 448.5626497268677 s.
[2024-12-25 14:16:48,281 INFO] Train perplexity: nan
[2024-12-25 14:16:48,281 INFO] Train accuracy: 75.092
[2024-12-25 14:16:48,281 INFO] Sentences processed: 42354
[2024-12-25 14:16:48,281 INFO] Average bsz:  211/ 210/ 3
[2024-12-25 14:16:48,282 INFO] Validation perplexity: 3.08991
[2024-12-25 14:16:48,283 INFO] Validation accuracy: 73.348
[2024-12-25 14:16:48,350 INFO] Saving optimizer and weights to step_500, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 14:16:48,731 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_500
[2024-12-25 14:16:48,732 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 14:22:16,681 INFO] Step 510/ 3000; acc: 83.6; ppl:  2.10; xent: 0.74; aux: 0.000; lr: 1.87e-04; sents:     921; bsz:  214/ 214/ 3;  88/ 88 tok/s;  18980 sec;
[2024-12-25 14:27:44,138 INFO] Step 520/ 3000; acc: 83.2; ppl:  2.14; xent: 0.76; aux: 0.000; lr: 1.85e-04; sents:     882; bsz:  212/ 212/ 3; 207/207 tok/s;  19307 sec;
[2024-12-25 14:33:11,023 INFO] Step 530/ 3000; acc: 81.3; ppl:  2.29; xent: 0.83; aux: 0.000; lr: 1.83e-04; sents:     800; bsz:  209/ 209/ 2; 205/205 tok/s;  19634 sec;
[2024-12-25 14:38:35,386 INFO] Step 540/ 3000; acc: 81.4; ppl:  2.29; xent: 0.83; aux: 0.000; lr: 1.82e-04; sents:     775; bsz:  203/ 203/ 2; 200/200 tok/s;  19958 sec;
[2024-12-25 14:44:02,674 INFO] Step 550/ 3000; acc: 82.6; ppl:  2.18; xent: 0.78; aux: 0.000; lr: 1.80e-04; sents:     859; bsz:  211/ 211/ 3; 207/207 tok/s;  20286 sec;
[2024-12-25 14:49:30,410 INFO] Step 560/ 3000; acc: 82.9; ppl:  2.14; xent: 0.76; aux: 0.000; lr: 1.78e-04; sents:     870; bsz:  213/ 213/ 3; 208/208 tok/s;  20613 sec;
[2024-12-25 14:55:04,454 INFO] Step 570/ 3000; acc: 81.3; ppl:  2.28; xent: 0.83; aux: 0.000; lr: 1.77e-04; sents:     857; bsz:  225/ 225/ 3; 215/215 tok/s;  20947 sec;
[2024-12-25 15:00:32,462 INFO] Step 580/ 3000; acc: 83.1; ppl:  2.14; xent: 0.76; aux: 0.000; lr: 1.75e-04; sents:     877; bsz:  213/ 213/ 3; 207/207 tok/s;  21275 sec;
[2024-12-25 15:05:58,873 INFO] Step 590/ 3000; acc: 84.4; ppl:  2.04; xent: 0.71; aux: 0.000; lr: 1.74e-04; sents:     935; bsz:  212/ 212/ 3; 208/208 tok/s;  21602 sec;
[2024-12-25 15:11:27,921 INFO] Step 600/ 3000; acc: 83.8; ppl:  2.08; xent: 0.73; aux: 0.000; lr: 1.72e-04; sents:     916; bsz:  217/ 217/ 3; 211/211 tok/s;  21931 sec;
[2024-12-25 15:18:55,292 INFO] valid stats calculation
                           took: 447.36796164512634 s.
[2024-12-25 15:18:55,295 INFO] Train perplexity: nan
[2024-12-25 15:18:55,295 INFO] Train accuracy: 76.3828
[2024-12-25 15:18:55,295 INFO] Sentences processed: 51046
[2024-12-25 15:18:55,295 INFO] Average bsz:  211/ 211/ 3
[2024-12-25 15:18:55,296 INFO] Validation perplexity: 2.95339
[2024-12-25 15:18:55,296 INFO] Validation accuracy: 74.3904
[2024-12-25 15:18:55,354 INFO] Saving optimizer and weights to step_600, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 15:18:55,553 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_600
[2024-12-25 15:18:55,553 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 15:24:23,620 INFO] Step 610/ 3000; acc: 83.2; ppl:  2.13; xent: 0.76; aux: 0.000; lr: 1.71e-04; sents:     865; bsz:  213/ 213/ 3;  88/ 88 tok/s;  22707 sec;
[2024-12-25 15:29:50,334 INFO] Step 620/ 3000; acc: 83.3; ppl:  2.10; xent: 0.74; aux: 0.000; lr: 1.69e-04; sents:     873; bsz:  210/ 210/ 3; 206/206 tok/s;  23033 sec;
[2024-12-25 15:35:20,281 INFO] Step 630/ 3000; acc: 81.8; ppl:  2.25; xent: 0.81; aux: 0.000; lr: 1.68e-04; sents:     812; bsz:  215/ 215/ 3; 209/209 tok/s;  23363 sec;
[2024-12-25 15:40:49,338 INFO] Step 640/ 3000; acc: 82.0; ppl:  2.21; xent: 0.80; aux: 0.000; lr: 1.67e-04; sents:     810; bsz:  214/ 214/ 3; 208/208 tok/s;  23692 sec;
[2024-12-25 15:46:15,648 INFO] Step 650/ 3000; acc: 82.1; ppl:  2.23; xent: 0.80; aux: 0.000; lr: 1.65e-04; sents:     822; bsz:  209/ 209/ 3; 205/205 tok/s;  24019 sec;
[2024-12-25 15:51:41,865 INFO] Step 660/ 3000; acc: 82.1; ppl:  2.20; xent: 0.79; aux: 0.000; lr: 1.64e-04; sents:     795; bsz:  208/ 208/ 2; 204/204 tok/s;  24345 sec;
[2024-12-25 15:57:07,747 INFO] Step 670/ 3000; acc: 82.9; ppl:  2.13; xent: 0.75; aux: 0.000; lr: 1.63e-04; sents:     792; bsz:  207/ 207/ 2; 203/203 tok/s;  24671 sec;
[2024-12-25 16:02:33,061 INFO] Step 680/ 3000; acc: 83.3; ppl:  2.11; xent: 0.75; aux: 0.000; lr: 1.62e-04; sents:     837; bsz:  208/ 208/ 3; 205/205 tok/s;  24996 sec;
[2024-12-25 16:07:59,215 INFO] Step 690/ 3000; acc: 83.6; ppl:  2.10; xent: 0.74; aux: 0.000; lr: 1.61e-04; sents:     867; bsz:  209/ 209/ 3; 205/205 tok/s;  25322 sec;
[2024-12-25 16:13:25,461 INFO] Step 700/ 3000; acc: 82.7; ppl:  2.18; xent: 0.78; aux: 0.000; lr: 1.59e-04; sents:     808; bsz:  210/ 210/ 3; 206/206 tok/s;  25648 sec;
[2024-12-25 16:20:53,772 INFO] valid stats calculation
                           took: 448.30785632133484 s.
[2024-12-25 16:20:53,775 INFO] Train perplexity: nan
[2024-12-25 16:20:53,776 INFO] Train accuracy: 77.2839
[2024-12-25 16:20:53,776 INFO] Sentences processed: 59327
[2024-12-25 16:20:53,776 INFO] Average bsz:  211/ 211/ 3
[2024-12-25 16:20:53,776 INFO] Validation perplexity: 2.8616
[2024-12-25 16:20:53,777 INFO] Validation accuracy: 75.2151
[2024-12-25 16:20:53,835 INFO] Saving optimizer and weights to step_700, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 16:20:54,042 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_700
[2024-12-25 16:20:54,043 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 16:26:19,770 INFO] Step 710/ 3000; acc: 82.5; ppl:  2.18; xent: 0.78; aux: 0.000; lr: 1.58e-04; sents:     784; bsz:  207/ 207/ 2;  85/ 85 tok/s;  26423 sec;
[2024-12-25 16:31:47,832 INFO] Step 720/ 3000; acc: 84.7; ppl:  2.01; xent: 0.70; aux: 0.000; lr: 1.57e-04; sents:     923; bsz:  214/ 214/ 3; 209/209 tok/s;  26751 sec;
[2024-12-25 16:37:14,619 INFO] Step 730/ 3000; acc: 84.2; ppl:  2.04; xent: 0.71; aux: 0.000; lr: 1.56e-04; sents:     875; bsz:  210/ 210/ 3; 206/206 tok/s;  27078 sec;
[2024-12-25 16:42:40,850 INFO] Step 740/ 3000; acc: 83.4; ppl:  2.10; xent: 0.74; aux: 0.000; lr: 1.55e-04; sents:     831; bsz:  210/ 210/ 3; 206/206 tok/s;  27404 sec;
[2024-12-25 16:48:08,356 INFO] Step 750/ 3000; acc: 84.2; ppl:  2.05; xent: 0.72; aux: 0.000; lr: 1.54e-04; sents:     878; bsz:  213/ 213/ 3; 208/208 tok/s;  27731 sec;
[2024-12-25 16:53:34,962 INFO] Step 760/ 3000; acc: 82.8; ppl:  2.15; xent: 0.77; aux: 0.000; lr: 1.53e-04; sents:     815; bsz:  211/ 211/ 3; 206/206 tok/s;  28058 sec;
[2024-12-25 16:58:59,104 INFO] Step 770/ 3000; acc: 81.5; ppl:  2.26; xent: 0.81; aux: 0.000; lr: 1.52e-04; sents:     745; bsz:  203/ 203/ 2; 200/200 tok/s;  28382 sec;
[2024-12-25 17:04:22,417 INFO] Step 780/ 3000; acc: 79.8; ppl:  2.40; xent: 0.88; aux: 0.000; lr: 1.51e-04; sents:     685; bsz:  200/ 200/ 2; 198/198 tok/s;  28705 sec;
[2024-12-25 17:09:41,048 INFO] Step 790/ 3000; acc: 78.1; ppl:  2.55; xent: 0.94; aux: 0.000; lr: 1.50e-04; sents:     562; bsz:  188/ 188/ 2; 189/189 tok/s;  29024 sec;
[2024-12-25 17:15:01,623 INFO] Step 800/ 3000; acc: 78.3; ppl:  2.56; xent: 0.94; aux: 0.000; lr: 1.49e-04; sents:     605; bsz:  193/ 193/ 2; 192/192 tok/s;  29345 sec;
[2024-12-25 17:22:30,794 INFO] valid stats calculation
                           took: 449.1673848628998 s.
[2024-12-25 17:22:30,799 INFO] Train perplexity: nan
[2024-12-25 17:22:30,799 INFO] Train accuracy: 77.8638
[2024-12-25 17:22:30,799 INFO] Sentences processed: 67030
[2024-12-25 17:22:30,799 INFO] Average bsz:  210/ 210/ 3
[2024-12-25 17:22:30,800 INFO] Validation perplexity: 2.75085
[2024-12-25 17:22:30,800 INFO] Validation accuracy: 75.9074
[2024-12-25 17:22:30,868 INFO] Saving optimizer and weights to step_800, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 17:22:31,261 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_800
[2024-12-25 17:22:31,262 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 17:27:51,313 INFO] Step 810/ 3000; acc: 78.1; ppl:  2.57; xent: 0.94; aux: 0.000; lr: 1.48e-04; sents:     579; bsz:  191/ 191/ 2;  80/ 80 tok/s;  30114 sec;
[2024-12-25 17:33:13,376 INFO] Step 820/ 3000; acc: 78.9; ppl:  2.49; xent: 0.91; aux: 0.000; lr: 1.47e-04; sents:     611; bsz:  197/ 197/ 2; 196/196 tok/s;  30436 sec;
[2024-12-25 17:38:34,073 INFO] Step 830/ 3000; acc: 77.7; ppl:  2.58; xent: 0.95; aux: 0.000; lr: 1.46e-04; sents:     553; bsz:  193/ 193/ 2; 192/192 tok/s;  30757 sec;
[2024-12-25 17:43:55,023 INFO] Step 840/ 3000; acc: 78.9; ppl:  2.46; xent: 0.90; aux: 0.000; lr: 1.46e-04; sents:     604; bsz:  194/ 194/ 2; 194/194 tok/s;  31078 sec;
[2024-12-25 17:49:17,602 INFO] Step 850/ 3000; acc: 78.3; ppl:  2.51; xent: 0.92; aux: 0.000; lr: 1.45e-04; sents:     589; bsz:  197/ 197/ 2; 196/196 tok/s;  31401 sec;
[2024-12-25 17:54:39,883 INFO] Step 860/ 3000; acc: 79.4; ppl:  2.40; xent: 0.88; aux: 0.000; lr: 1.44e-04; sents:     627; bsz:  198/ 198/ 2; 196/196 tok/s;  31723 sec;
[2024-12-25 18:00:04,383 INFO] Step 870/ 3000; acc: 79.2; ppl:  2.44; xent: 0.89; aux: 0.000; lr: 1.43e-04; sents:     625; bsz:  204/ 204/ 2; 201/201 tok/s;  32047 sec;
[2024-12-25 18:05:27,011 INFO] Step 880/ 3000; acc: 80.5; ppl:  2.33; xent: 0.85; aux: 0.000; lr: 1.42e-04; sents:     672; bsz:  199/ 199/ 2; 197/197 tok/s;  32370 sec;
[2024-12-25 18:08:27,533 INFO] Step 886, cuda OOM - batch removed
[2024-12-25 18:10:48,481 INFO] Step 890/ 3000; acc: 80.3; ppl:  2.35; xent: 0.85; aux: 0.000; lr: 1.41e-04; sents:     642; bsz:  199/ 196/ 2; 198/194 tok/s;  32692 sec;
[2024-12-25 18:16:10,995 INFO] Step 900/ 3000; acc: 80.7; ppl:  2.30; xent: 0.83; aux: 0.000; lr: 1.41e-04; sents:     679; bsz:  199/ 199/ 2; 197/197 tok/s;  33014 sec;
[2024-12-25 18:23:40,446 INFO] valid stats calculation
                           took: 449.4467830657959 s.
[2024-12-25 18:23:40,449 INFO] Train perplexity: nan
[2024-12-25 18:23:40,449 INFO] Train accuracy: 78.005
[2024-12-25 18:23:40,449 INFO] Sentences processed: 73211
[2024-12-25 18:23:40,450 INFO] Average bsz:  209/ 208/ 3
[2024-12-25 18:23:40,450 INFO] Validation perplexity: 2.63678
[2024-12-25 18:23:40,450 INFO] Validation accuracy: 77.0171
[2024-12-25 18:23:40,507 INFO] Saving optimizer and weights to step_900, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 18:23:40,844 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_900
[2024-12-25 18:23:40,844 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 18:29:01,323 INFO] Step 910/ 3000; acc: 79.1; ppl:  2.45; xent: 0.90; aux: 0.000; lr: 1.40e-04; sents:     587; bsz:  192/ 192/ 2;  80/ 80 tok/s;  33784 sec;
[2024-12-25 18:34:22,643 INFO] Step 920/ 3000; acc: 80.2; ppl:  2.35; xent: 0.85; aux: 0.000; lr: 1.39e-04; sents:     656; bsz:  197/ 197/ 2; 196/196 tok/s;  34106 sec;
[2024-12-25 18:39:43,923 INFO] Step 930/ 3000; acc: 79.8; ppl:  2.38; xent: 0.87; aux: 0.000; lr: 1.38e-04; sents:     598; bsz:  195/ 195/ 2; 194/194 tok/s;  34427 sec;
[2024-12-25 18:45:05,091 INFO] Step 940/ 3000; acc: 78.5; ppl:  2.51; xent: 0.92; aux: 0.000; lr: 1.38e-04; sents:     553; bsz:  194/ 194/ 2; 193/193 tok/s;  34748 sec;
[2024-12-25 18:50:27,299 INFO] Step 950/ 3000; acc: 79.5; ppl:  2.41; xent: 0.88; aux: 0.000; lr: 1.37e-04; sents:     620; bsz:  197/ 197/ 2; 196/196 tok/s;  35070 sec;
[2024-12-25 18:55:47,362 INFO] Step 960/ 3000; acc: 79.5; ppl:  2.40; xent: 0.88; aux: 0.000; lr: 1.36e-04; sents:     576; bsz:  192/ 192/ 2; 192/192 tok/s;  35390 sec;
[2024-12-25 19:01:10,084 INFO] Step 970/ 3000; acc: 79.3; ppl:  2.44; xent: 0.89; aux: 0.000; lr: 1.35e-04; sents:     590; bsz:  198/ 198/ 2; 196/196 tok/s;  35713 sec;
[2024-12-25 19:06:31,738 INFO] Step 980/ 3000; acc: 79.9; ppl:  2.37; xent: 0.86; aux: 0.000; lr: 1.35e-04; sents:     615; bsz:  196/ 196/ 2; 195/195 tok/s;  36035 sec;
[2024-12-25 19:11:55,144 INFO] Step 990/ 3000; acc: 80.2; ppl:  2.34; xent: 0.85; aux: 0.000; lr: 1.34e-04; sents:     635; bsz:  201/ 201/ 2; 198/198 tok/s;  36358 sec;
[2024-12-25 19:17:16,433 INFO] Step 1000/ 3000; acc: 80.5; ppl:  2.32; xent: 0.84; aux: 0.000; lr: 1.33e-04; sents:     618; bsz:  195/ 195/ 2; 194/194 tok/s;  36679 sec;
[2024-12-25 19:24:44,298 INFO] valid stats calculation
                           took: 447.8613176345825 s.
[2024-12-25 19:24:44,300 INFO] Train perplexity: nan
[2024-12-25 19:24:44,301 INFO] Train accuracy: 78.1615
[2024-12-25 19:24:44,301 INFO] Sentences processed: 79259
[2024-12-25 19:24:44,301 INFO] Average bsz:  207/ 207/ 2
[2024-12-25 19:24:44,302 INFO] Validation perplexity: 2.56155
[2024-12-25 19:24:44,302 INFO] Validation accuracy: 77.7015
[2024-12-25 19:24:44,357 INFO] Saving optimizer and weights to step_1000, and symlink to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 19:24:44,683 INFO] Saving transforms artifacts, if any, to ./finetune/TowerInstruct-7b-v0.2/step_1000
[2024-12-25 19:24:44,683 INFO] Saving config and vocab to ./finetune/TowerInstruct-7b-v0.2
[2024-12-25 19:30:09,794 INFO] Step 1010/ 3000; acc: 80.2; ppl:  2.36; xent: 0.86; aux: 0.000; lr: 1.33e-04; sents:     653; bsz:  204/ 204/ 2;  84/ 84 tok/s;  37453 sec;
[2024-12-25 19:35:33,148 INFO] Step 1020/ 3000; acc: 81.1; ppl:  2.27; xent: 0.82; aux: 0.000; lr: 1.32e-04; sents:     685; bsz:  201/ 201/ 2; 199/199 tok/s;  37776 sec;
[2024-12-25 19:40:56,069 INFO] Step 1030/ 3000; acc: 80.2; ppl:  2.34; xent: 0.85; aux: 0.000; lr: 1.31e-04; sents:     627; bsz:  199/ 199/ 2; 197/197 tok/s;  38099 sec;
[2024-12-25 19:46:16,404 INFO] Step 1040/ 3000; acc: 80.1; ppl:  2.35; xent: 0.85; aux: 0.000; lr: 1.31e-04; sents:     601; bsz:  192/ 192/ 2; 192/192 tok/s;  38419 sec;
[2024-12-25 19:51:36,757 INFO] Step 1050/ 3000; acc: 80.6; ppl:  2.31; xent: 0.84; aux: 0.000; lr: 1.30e-04; sents:     612; bsz:  192/ 192/ 2; 192/192 tok/s;  38740 sec;
[2024-12-25 19:56:57,267 INFO] Step 1060/ 3000; acc: 80.1; ppl:  2.35; xent: 0.85; aux: 0.000; lr: 1.30e-04; sents:     589; bsz:  193/ 193/ 2; 192/192 tok/s;  39060 sec;
[2024-12-25 20:02:20,735 INFO] Step 1070/ 3000; acc: 81.7; ppl:  2.22; xent: 0.80; aux: 0.000; lr: 1.29e-04; sents:     724; bsz:  202/ 202/ 2; 200/200 tok/s;  39384 sec;
[2024-12-25 20:07:47,315 INFO] Step 1080/ 3000; acc: 81.2; ppl:  2.25; xent: 0.81; aux: 0.000; lr: 1.28e-04; sents:     720; bsz:  208/ 208/ 2; 204/204 tok/s;  39710 sec;
[2024-12-25 20:13:10,527 INFO] Step 1090/ 3000; acc: 80.5; ppl:  2.31; xent: 0.84; aux: 0.000; lr: 1.28e-04; sents:     628; bsz:  200/ 200/ 2; 198/198 tok/s;  40034 sec;

I didn't let it run to 2000 steps because it takes an awful long time to train. I don't understand why it takes so long. I'm using an AWS g4dn.16xlarge and I feel like I'm not using all the power, although I see the GPU hits 97% performance and flashes OOM messages. I never had those OOM messages in any of the previous fine-tuning runs.

ubuntu@ip-172-31-3-172:~$ nvidia-smi
Wed Dec 25 09:23:47 2024
+-------------------------------------------------------------------------------                                                                                                             ----------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version:                                                                                                              12.7     |
|-----------------------------------------+------------------------+------------                                                                                                             ----------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Un                                                                                                             corr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  C                                                                                                             ompute M. |
|                                         |                        |                                                                                                                            MIG M. |
|=========================================+========================+============                                                                                                             ==========|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                                                                                                                                 0 |
| N/A   47C    P0             67W /   70W |    8817MiB /  15360MiB |     97%                                                                                                                   Default |
|                                         |                        |                                                                                                                               N/A |
+-----------------------------------------+------------------------+------------                                                                                                             ----------+
                                                                                                                                                                                             
+-------------------------------------------------------------------------------                                                                                                             ----------+
| Processes:                                                                                                                                                                                           |
|  GPU   GI   CI        PID   Type   Process name                              G                                                                                                             PU Memory |
|        ID   ID                                                               U                                                                                                             sage      |
|===============================================================================                                                                                                             ==========|
|    0   N/A  N/A    173906      C   /home/ubuntu/TY-EN/TY-EN/bin/python3                                                                                                                      8814MiB |
+-------------------------------------------------------------------------------                                                                                                             ----------+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad results from fine-tuning TowerInstruct-7B-v0.2 #162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Bad results from fine-tuning TowerInstruct-7B-v0.2 #162

HURIMOZ Dec 17, 2024

Replies: 5 comments · 3 replies

francoishernandez Dec 17, 2024 Maintainer

vince62s Dec 18, 2024 Maintainer

HURIMOZ Dec 24, 2024 Author

vince62s Dec 24, 2024 Maintainer

vince62s Dec 24, 2024 Maintainer

HURIMOZ Dec 25, 2024 Author

vince62s Dec 24, 2024 Maintainer

HURIMOZ Dec 26, 2024 Author

HURIMOZ
Dec 17, 2024

Replies: 5 comments 3 replies

francoishernandez
Dec 17, 2024
Maintainer

vince62s
Dec 18, 2024
Maintainer

HURIMOZ Dec 24, 2024
Author

vince62s
Dec 24, 2024
Maintainer

vince62s
Dec 24, 2024
Maintainer

HURIMOZ Dec 25, 2024
Author

vince62s
Dec 24, 2024
Maintainer

HURIMOZ Dec 26, 2024
Author