-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761
Comments
If you run it with PROCESSES_PER_NODE=8, it compiles and runs fine. However, it only runs on 8 cores of the 32 available. if you change it to PROCESSES_PER_NODE=16, you get this error right after it finds the pre-compiled artifacts. PROCESSES_PER_NODE=32 does the same thing. I am running this on the latest HF DLAMI on an trn1.32xlarge. I didn't enable EFA, but it is only a single node.
|
It looks like it is trying to do some kind of Collective Communication, (gather/scatter/reduce) and it is trying to do it across an unsupported group ([0 2 4 6 8 10 12 14 ] and [1 3 5 7 9 11 13 15 ]) instead of connected devices. However, why would it be doing something collective at the very beginning? Just a guess after staring at it for a while. |
Looks like we initialize the groups through this path. maybe the initialization is causing the CC error. (I purposefully generated this error instead of tracing the code):
|
System Info
AMI Name: huggingface-neuron-2024-12-13T12-47-53Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a
AMI-ID: ami-0bede50341b2516c4
optimum-cli env
Copy-and-paste the text below in your GitHub issue:
Platform:
Python packages:
optimum-neuron
version: 0.0.27neuron-sdk
version: 2.20.2optimum
version: 1.22.0transformers
version: 4.43.2huggingface_hub
version: 0.26.5torch
version: 2.1.2+cu121aws-neuronx-runtime-discovery
version: 2.9libneuronxla
version: 2.0.5347.0neuronx-cc
version: 2.15.143.0+e39249adneuronx-distributed
version: 0.9.0neuronx-hwm
version: NAtorch-neuronx
version: 2.1.2.2.3.2torch-xla
version: 2.1.5transformers-neuronx
version: 0.12.313Neuron Driver:
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
aws-neuronx-collectives/unknown,now 2.22.26.0-17a033bc8 amd64 [installed,upgradable to: 2.23.133.0-3e70920f2]
aws-neuronx-dkms/unknown,now 2.18.12.0 amd64 [installed,upgradable to: 2.19.64.0]
aws-neuronx-oci-hook/unknown,now 2.5.3.0 amd64 [installed,upgradable to: 2.6.36.0]
aws-neuronx-runtime-lib/unknown,now 2.22.14.0-6e27b8d5b amd64 [installed,upgradable to: 2.23.110.0-9b5179492]
aws-neuronx-tools/unknown,now 2.19.0.0 amd64 [installed,upgradable to: 2.20.204.0]
Who can help?
@michaelbenayoun
Running this tutorial Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance gives following error at the compile step:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Execute this script from tutorial
Expected behavior
It should compile without error.
The text was updated successfully, but these errors were encountered: