Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use deployed Neuron model with error NRT:nrt_allocate_neuron_cores #9

Open
marina-pchelina opened this issue Jun 13, 2024 · 5 comments

Comments

@marina-pchelina
Copy link

Hi, I'm following the sample here to try to compile a model to Neuron and deploy on SageMaker.

Following the steps in the sample exactly, I am able to deploy the model, but it when I try to use it I get the 500 error and my CloudWatch traceback shows the following:

Screenshot 2024-06-13 at 13 14 56

I only saw this error when trying to use a second model in the same instance while another one is running, but that should not be the case here.

@jluntamazon
Copy link

Hi @marina-pchelina

Since NeuronCores are reserved per process, it's possible that you have an old process which is holding onto the NeuronCores but has not been properly terminated. One thing to try is to forcefully stop all running processes: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#neuroncore-s-not-available-requested-1-available-0

@marina-pchelina
Copy link
Author

hi, thanks for getting back to me!
What I don't understand is how can neuron cores be holding on to any old process if a new inference instance is initialized each time I deploy. Anyway, I tried including some commands from the troubleshooting doc and some other I found in the top of my inference.py script like so:

import subprocess

print("Running apt-get install kmod")
print(subprocess.run(['apt-get install kmod'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running lsmod | grep neuron")
print(subprocess.run(['lsmod | grep neuron'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running ps aux | grep python")
print(subprocess.run(['ps aux | grep python'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running neuron-ls")
print(subprocess.run(['neuron-ls'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running modinfo neuron")
print(subprocess.run(['modinfo neuron'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))

I can see the 2 cores are there with neuron-ls:
Screenshot 2024-06-14 at 18 17 13

No significant python processes that could be using the cores, killing them all explicitly didn't help ether.

Screenshot 2024-06-14 at 18 17 03

However, seems like I'm not able to use lsmod or modinfo, which I'm able to use and get output from from inside an EC2 instance (same inf2) directly. I tried installing them with apt-get install kmod but that didn't help either.

Screenshot 2024-06-14 at 18 20 48

Could that possibly have something to do with image that's used in the tutorial? It's currently this one:
ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py310-sdk2.13.2-ubuntu20.04"

@jluntamazon
Copy link

@marina-pchelina, we were able to reproduce the problem in the tutorial and are looking into a fix.

The root cause appears to be 2 separate misconfigurations:

  1. The default number of workers is set to 4. In torchserve each worker is a separate process which must take control of a single NeuronCore. Since the ml.inf2.xlarge instance has only 2 NeuronCores, this is an invalid number of workers. You can observe the configuration in the beginning of the logs: Default workers per model: 4
  2. The default number of NeuronCores per worker appears to be not configured which is causing the first process to attempt to take ownership of all NeuronCores. The Neuron runtime allows each process to take control of as many NeuronCores as it needs. The default behavior is that 1 process takes ownership of all NeuronCores on the instance. When using process-level workers, this means that each process should be configured with the environment variable NEURON_RT_NUM_CORES=1 so that it only takes ownership of a single NeuronCores for the model that it loads. You can see this in the logs because 4 warnings are issued for each model load followed by only 3 nrt_allocate_neuron_cores errors showing that the NeuronCores have already been allocated to another process.

@marina-pchelina
Copy link
Author

thanks for looking into this!
I tried to re-compile the model with --target inf2 on the off chance it might help configure the num of workers, but it still showed Default workers per model: 4 in the logs.
If it's any help, I can deploy and use models through the HuggingFace integration, the problem with that is, I want to use both cores with DataParallel, which the HuggingFace class doesn't seem to allow to do.
Let me know if there's anything I can do myself to work around that, otherwise, I'll wait for a fix.

@marina-pchelina
Copy link
Author

@jluntamazon
Looks like errors start happening even earlier now, I can't install neuronx runtime, running the same notebook in the same env (ml.c5.4xlarge with conda_pytorch_p310 kernel).

Both commands, the original
%pip install --upgrade neuronx-cc==2.* torch-neuronx
and one not specifying package versions
%pip install --upgrade neuronx-cc torch-neuronx --no-cache-dir
result in the same error.
Any pointers for successful installation would be much appreciated!

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants