Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: probability tensor contains either inf, nan or element < 0 #129

Open
dongzhiwen1218 opened this issue Dec 14, 2023 · 0 comments

Comments

@dongzhiwen1218
Copy link

I got this error while doing Fine-tuning:

File "/mnt/oss-data/xxx/minillm/transformers/src/transformers/generation/utils.py", line 3000, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0
[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12080 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12081 closing signal SIGTERM
...
Does anyone see the same issue?

model config:
{
"_name_or_path": "/xxx",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 40,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 55296
}

cmd:
export NCCL_DEBUG=""
export WANDB_DISABLED=True
export TF_CPP_MIN_LOG_LEVEL=0
export TORCH_CPP_LOG_LEVEL=0
torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 2012 /mnt/oss-data/xxx/minillm/finetune.py --base-path /mnt/oss-data/xxx/minillm --model-path /mnt/oss-data/xxx/minillm/xxx/ --ckpt-name xxx --n-gpu 4 --model-type llama2 --gradient-checkpointing --model-parallel --model-parallel-size 4 --data-dir /mnt/oss-data/xxx/minillm/processed_data/dolly/full/llama2/ --num-workers 0 --dev-num 500 --lr 0.00001 --batch-size 4 --eval-batch-size 8 --gradient-accumulation-steps 2 --warmup-iters 0 --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --epochs 10 --max-length 512 --max-prompt-length 256 --do-train --do-valid --eval-gen --save-interval -1 --eval-interval -1 --log-interval 4 --mid-log-num 1 --save /mnt/oss-data/xxx/minillm/results/llama2/train/sft --seed 10 --seed-order 10 --deepspeed --deepspeed_config /mnt/oss-data/xxx/minillm/configs/deepspeed/ds_config_zero2_offload.json --type lm --do-sample --top-k 1 --top-p 0.9 --temperature 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant