Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

Open
1 of 4 tasks
ajayvohra2005 opened this issue Jan 11, 2025 · 3 comments
Open
1 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@ajayvohra2005
Copy link

ajayvohra2005 commented Jan 11, 2025

System Info

AMI Name: huggingface-neuron-2024-12-13T12-47-53Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a
AMI-ID: ami-0bede50341b2516c4

optimum-cli env

Copy-and-paste the text below in your GitHub issue:

Platform:

  • Platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35
  • Python version: 3.10.12

Python packages:

  • optimum-neuron version: 0.0.27
  • neuron-sdk version: 2.20.2
  • optimum version: 1.22.0
  • transformers version: 4.43.2
  • huggingface_hub version: 0.26.5
  • torch version: 2.1.2+cu121
  • aws-neuronx-runtime-discovery version: 2.9
  • libneuronxla version: 2.0.5347.0
  • neuronx-cc version: 2.15.143.0+e39249ad
  • neuronx-distributed version: 0.9.0
  • neuronx-hwm version: NA
  • torch-neuronx version: 2.1.2.2.3.2
  • torch-xla version: 2.1.5
  • transformers-neuronx version: 0.12.313

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.22.26.0-17a033bc8 amd64 [installed,upgradable to: 2.23.133.0-3e70920f2]
aws-neuronx-dkms/unknown,now 2.18.12.0 amd64 [installed,upgradable to: 2.19.64.0]
aws-neuronx-oci-hook/unknown,now 2.5.3.0 amd64 [installed,upgradable to: 2.6.36.0]
aws-neuronx-runtime-lib/unknown,now 2.22.14.0-6e27b8d5b amd64 [installed,upgradable to: 2.23.110.0-9b5179492]
aws-neuronx-tools/unknown,now 2.19.0.0 amd64 [installed,upgradable to: 2.20.204.0]

Who can help?

@michaelbenayoun

Running this tutorial Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance gives following error at the compile step:

++ export NEURON_FUSE_SOFTMAX=1
++ NEURON_FUSE_SOFTMAX=1
++ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ export MALLOC_ARENA_MAX=64
++ MALLOC_ARENA_MAX=64
++ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ PROCESSES_PER_NODE=8
++ NUM_EPOCHS=1
++ TP_DEGREE=2
++ PP_DEGREE=1
++ BS=1
++ GRADIENT_ACCUMULATION_STEPS=8
++ LOGGING_STEPS=1
++ MODEL_NAME=meta-llama/Meta-Llama-3-8B
++ OUTPUT_DIR=output-
++ '[' '' = 1 ']'
++ MAX_STEPS=-1
++ XLA_USE_BF16=1
++ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running trial run (add option to terminate trial run early; also ignore trial run's generated outputs, i.e. loss, checkpoints)
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running cmd: ['torchrun', '--nproc_per_node', '8', 'docs/source/training_tutorials/sft_lora_finetune_llm.py', '--model_id', 'meta-llama/Meta-Llama-3-8B', '--num_train_epochs', '1', '--do_train', '--learning_rate', '5e-5', '--warmup_ratio', '0.03', '--max_steps', '-1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', 'true', '--bf16', '--zero_1', 'false', '--tensor_parallel_size', '2', '--pipeline_parallel_size', '1', '--logging_steps', '1', '--save_total_limit', '1', '--output_dir', 'output-', '--lr_scheduler_type', 'constant', '--overwrite_output_dir']
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 48.2MB/s]
Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 30.4MB/s]
Generating train split: 100%|██████████| 15011/15011 [00:00<00:00, 161040.86 examples/s]
2025-Jan-11 21:51:55.0550 32498:33576 [3] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0551 32502:33575 [7] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0552 32498:33576 [3] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0554 32502:33575 [7] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0559 32496:33577 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0561 32496:33577 [1] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:51:59.0995 32499:33591 [4] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:59.0997 32499:33591 [4] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:00.0910 32497:33594 [2] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:00.0911 32497:33594 [2] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0051 32501:33592 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0053 32501:33592 [6] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0191 32495:33593 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0193 32495:33593 [0] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:52:08.0236 32500:33608 [5] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:08.0238 32500:33608 [5] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.63s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it]
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.68s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.71s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.69s/it] 
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
[2025-01-11 21:58:37,190] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 32495) of binary: /opt/aws_neuronx_venv_pytorch_2_1/bin/python3
Traceback (most recent call last):
  File "/opt/aws_neuronx_venv_pytorch_2_1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
docs/source/training_tutorials/sft_lora_finetune_llm.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 32496)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 32497)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 32498)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 32499)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 32500)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 32501)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 32502)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 32495)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2025-01-11 21:58:37.000640:  32302  ERROR ||NEURON_PARALLEL_COMPILE||: There was an error in the training script.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Execute this script from tutorial

#!/bin/bash

set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi


XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

Expected behavior

It should compile without error.

@ajayvohra2005 ajayvohra2005 added the bug Something isn't working label Jan 11, 2025
@michaelbenayoun michaelbenayoun self-assigned this Jan 14, 2025
@jimburtoft
Copy link
Contributor

pip install trl==0.11.4 will fix the "max_seq_length" error above. It looks like the is_trl_available isn't checking correctly if the library isn't installed.

If you run it with PROCESSES_PER_NODE=8, it compiles and runs fine. However, it only runs on 8 cores of the 32 available. if you change it to PROCESSES_PER_NODE=16, you get this error right after it finds the pre-compiled artifacts. PROCESSES_PER_NODE=32 does the same thing.

I am running this on the latest HF DLAMI on an trn1.32xlarge. I didn't enable EFA, but it is only a single node.

2025-01-16 00:39:17.000434:  31598  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /home/ubuntu/cache_dir_neuron/neuronxcc-2.15.143.0+e39249ad/MODULE_3991015933460767241+6d1be540/model.neff. Exiting with a successfully compiled graph.
2025-Jan-16 00:39:18.3830942025-Jan-16 00:39:18.383099 31597:33785 ERROR   ENC:init_metaring_algorithm                 2025-Jan-16 00:39:18.383096 31595:33778 ERROR   ENC:init_metaring_algorithm                 [nec_dev 8 ch 2] failed to reuse global comm devmem reservation 31601:33751 ERROR   ENC:init_metaring_algorithm                 [nec_dev 6 ch 2] failed to reuse global comm devmem reservation[nec_dev 12 ch 2] failed to reuse global comm devmem reservation

2025-Jan-16 00:39:18.383123
 31599:33753 ERROR   ENC:init_metaring_algorithm                 2025-Jan-16 00:39:18.383133[nec_dev 10 ch 2] failed to reuse global comm devmem reservation 31603:33780 ERROR   ENC:init_metaring_algorithm                 [nec_dev 14 ch 2] failed to reuse global comm devmem reservation

2025-Jan-16 00:39:18.383184 31593:33750 ERROR   ENC:init_metaring_algorithm                 [nec_dev 4 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.382943 31589:33755 ERROR   ENC:init_metaring_algorithm                 [nec_dev 0 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.383016 31591:33754 ERROR   ENC:init_metaring_algorithm                 [nec_dev 2 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.441186 31595:33778 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.441335 31597:33785 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.441507 31599:33753 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.443243 31601:33751 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.443280 31603:33780 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.443311 31589:33755 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.440914 31591:33754 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.441031 31593:33750 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [0 2 4 6 8 10 12 14 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.571065 31596:33811 ERROR   ENC:init_metaring_algorithm                 [nec_dev 7 ch 2] failed to reuse global comm devmem reservation2025-Jan-16 00:39:18.5710772025-Jan-16 00:39:18.571084 31592:33781 ERROR   ENC:init_metaring_algorithm
[nec_dev 3 ch 2] failed to reuse global comm devmem reservation 31598:33752 ERROR   ENC:init_metaring_algorithm                 [nec_dev 9 ch 2] failed to reuse global comm devmem reservation

2025-Jan-16 00:39:18.571097 31600:33786 ERROR   ENC:init_metaring_algorithm                 [nec_dev 11 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.571107 31602:33782 ERROR   ENC:init_metaring_algorithm                 [nec_dev 13 ch 2] failed to reuse global comm devmem reservation2025-Jan-16 00:39:18.571115
 31604:33749 ERROR   ENC:init_metaring_algorithm                 [nec_dev 15 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.570950 31590:33784 ERROR   ENC:init_metaring_algorithm                 [nec_dev 1 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.570981 31594:33783 ERROR   ENC:init_metaring_algorithm                 [nec_dev 5 ch 2] failed to reuse global comm devmem reservation
2025-Jan-16 00:39:18.628881 31598:33752 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
2025-Jan-16 00:39:18.629024 31600:33786 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.629070 31602:33782 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.629262 31604:33749 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.629449 31590:33784 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.631888 31592:33781 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.628642 31594:33783 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
2025-Jan-16 00:39:18.628733 31596:33811 ERROR   ENC:enc_init_comm                           global comm (2) has less channels than this replica group (4) :likely not enough EFA devices found if running on multiple nodes or CC not permitted on this group [1 3 5 7 9 11 13 15 ]
python3: /opt/workspace/KaenaRuntime/tdrv/encd.c:4147: prep_metaring_topsp_config: Assertion `spe != NULL' failed.
[2025-01-16 00:39:22,324] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 31589 closing signal SIGTERM

@jimburtoft
Copy link
Contributor

It looks like it is trying to do some kind of Collective Communication, (gather/scatter/reduce) and it is trying to do it across an unsupported group ([0 2 4 6 8 10 12 14 ] and [1 3 5 7 9 11 13 15 ]) instead of connected devices.

However, why would it be doing something collective at the very beginning?

Just a guess after staring at it for a while.

@jimburtoft
Copy link
Contributor

Looks like we initialize the groups through this path. maybe the initialization is causing the CC error. (I purposefully generated this error instead of tracing the code):

    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 52, in training_function
    trainer = NeuronSFTTrainer(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1764, in __init__
    super().__init__(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 190, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 1525, in __init__
    return Trainer.__init__(self, *args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/transformers/trainer.py", line 409, in __init__
    self.create_accelerator_and_postprocess()
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 290, in create_accelerator_and_postprocess
    self.accelerator = NeuronAccelerator(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/accelerator.py", line 153, in __init__
    super().__init__(**full_kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/accelerate/accelerator.py", line 375, in __init__
    self.state = AcceleratorState(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/optimum/neuron/accelerate/state.py", line 204, in __init__
    parallel_state.initialize_model_parallel(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/parallel_state.py", line 234, in initialize_model_parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants