-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak #156
Comments
@novak2000 i'm too having this issue, I tried on my laptop in CPU when i try to use Embed4all or SentenceTransformer, both had huge increase in memory after each request, kindly let me know if you found any solutions |
It seems its linked to an issue with Hyper: hyperium/hyper#1790 |
I seem to still be running into this issue of steadily growing mem usage of the text-embeddings-router process. This is with TEI 1.1.0, but I also tested 1.0.0 with same results. Running with docker: docker run --name tei --gpus all -e CUDA_MEMORY_FRACTION=1.0 -p 8081:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.1.0 --model-id $model --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls model is: BAAI/bge-small-en-v1.5 |
Do you have a graph of the memory increase? And if you have v1.0.0 vs v1.1.0 that would be amazing. |
I don't have a pretty graph, but here are 3 ps outputs over the past 24 hrs, the first one is from just after starting the docker image, the 2nd is from not long after, and the 3rd is from a minute ago where you can see the mem percentage has grown to 8.2% of the server's ram, from the first output's 3.6%
|
I'm embedding with default config with 1 milion vector without any issue, maybe worker cause leak? |
BTW I forgot to mention regarding 1.0.0 vs 1.1.0, I tried both and performance seemed similar in regards to the mem use growing. The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs. |
Ok I will keep this in my prio list but it must be very deep in the stack and might take some time to find.
That's great :) It's always nice to hear that the project is running in prod with some real throughput requirements. |
Ok if you need any other details let me know. The instance is on AWS, a g5.xlarge (1 nvidia A10G gpu), using the AMI: Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103 the gpu is being shared, 90% of it is a separate vllm text generation server, the other 10% gets used by TEI. |
@novak2000 can you use 1.1 and keep the memory resources limit? I'm wondering if the container will be killed or not on 1.1. |
Ok thanks for this info. |
Any news on this ? I'm running into the same issue and it's not usable in production. |
Yes it seems that there was a leak in one of our dependency. This is othogonal to the problem of allocator reported above. See: https://www.algolia.com/blog/engineering/when-allocators-are-hoarding-your-precious-memory/ for more info on the subject. I will release 1.3 with this PR today. Will you be able to test it and report if the problem is indeed fixed? |
I can try it today if you want but I don't see the 1.3 release for the moment. |
It's released now. |
@djanito, were you able to try it out? |
System Info
I'm running a docker container to run BAAI rerank-base model on a local PC with RTX 4090 and intel i9-13900KF, 64GB RAM


Information
Tasks
Reproduction
After calling '/rerank' request many times (around 400,000 times with 5000 texts each) RAM memory usage increases significantly(from 6GB to 42+GB).


Memory usage before:
after:
Expected behavior
Is this behavior expected? Since I'm unfamiliar with rust and its basic concepts, any feedback would be helpful.
Thanks!
The text was updated successfully, but these errors were encountered: