We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inf
nan
I got this error while doing Fine-tuning:
File "/mnt/oss-data/xxx/minillm/transformers/src/transformers/generation/utils.py", line 3000, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0 [I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally [I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12080 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12081 closing signal SIGTERM ... Does anyone see the same issue?
model config: { "_name_or_path": "/xxx", "architectures": [ "LlamaForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 13824, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 40, "num_hidden_layers": 40, "num_key_value_heads": 40, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.31.0", "use_cache": true, "vocab_size": 55296 }
cmd: export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=0 export TORCH_CPP_LOG_LEVEL=0 torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 2012 /mnt/oss-data/xxx/minillm/finetune.py --base-path /mnt/oss-data/xxx/minillm --model-path /mnt/oss-data/xxx/minillm/xxx/ --ckpt-name xxx --n-gpu 4 --model-type llama2 --gradient-checkpointing --model-parallel --model-parallel-size 4 --data-dir /mnt/oss-data/xxx/minillm/processed_data/dolly/full/llama2/ --num-workers 0 --dev-num 500 --lr 0.00001 --batch-size 4 --eval-batch-size 8 --gradient-accumulation-steps 2 --warmup-iters 0 --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --epochs 10 --max-length 512 --max-prompt-length 256 --do-train --do-valid --eval-gen --save-interval -1 --eval-interval -1 --log-interval 4 --mid-log-num 1 --save /mnt/oss-data/xxx/minillm/results/llama2/train/sft --seed 10 --seed-order 10 --deepspeed --deepspeed_config /mnt/oss-data/xxx/minillm/configs/deepspeed/ds_config_zero2_offload.json --type lm --do-sample --top-k 1 --top-p 0.9 --temperature 1.0
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I got this error while doing Fine-tuning:
File "/mnt/oss-data/xxx/minillm/transformers/src/transformers/generation/utils.py", line 3000, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either
inf
,nan
or element < 0next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either
inf
,nan
or element < 0[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12080 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12081 closing signal SIGTERM
...
Does anyone see the same issue?
model config:
{
"_name_or_path": "/xxx",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 40,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 55296
}
cmd:
export NCCL_DEBUG=""
export WANDB_DISABLED=True
export TF_CPP_MIN_LOG_LEVEL=0
export TORCH_CPP_LOG_LEVEL=0
torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 2012 /mnt/oss-data/xxx/minillm/finetune.py --base-path /mnt/oss-data/xxx/minillm --model-path /mnt/oss-data/xxx/minillm/xxx/ --ckpt-name xxx --n-gpu 4 --model-type llama2 --gradient-checkpointing --model-parallel --model-parallel-size 4 --data-dir /mnt/oss-data/xxx/minillm/processed_data/dolly/full/llama2/ --num-workers 0 --dev-num 500 --lr 0.00001 --batch-size 4 --eval-batch-size 8 --gradient-accumulation-steps 2 --warmup-iters 0 --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --epochs 10 --max-length 512 --max-prompt-length 256 --do-train --do-valid --eval-gen --save-interval -1 --eval-interval -1 --log-interval 4 --mid-log-num 1 --save /mnt/oss-data/xxx/minillm/results/llama2/train/sft --seed 10 --seed-order 10 --deepspeed --deepspeed_config /mnt/oss-data/xxx/minillm/configs/deepspeed/ds_config_zero2_offload.json --type lm --do-sample --top-k 1 --top-p 0.9 --temperature 1.0
The text was updated successfully, but these errors were encountered: