Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak #156

Open
2 of 4 tasks
novak2000 opened this issue Feb 11, 2024 · 19 comments · Fixed by #161
Open
2 of 4 tasks

Potential memory leak #156

novak2000 opened this issue Feb 11, 2024 · 19 comments · Fixed by #161

Comments

@novak2000
Copy link

System Info

I'm running a docker container to run BAAI rerank-base model on a local PC with RTX 4090 and intel i9-13900KF, 64GB RAM
Screenshot from 2024-02-11 16-15-22
image

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

After calling '/rerank' request many times (around 400,000 times with 5000 texts each) RAM memory usage increases significantly(from 6GB to 42+GB).
Memory usage before:
Screenshot from 2024-02-11 00-35-19
after:
Screenshot from 2024-02-11 16-14-30

Expected behavior

Is this behavior expected? Since I'm unfamiliar with rust and its basic concepts, any feedback would be helpful.
Thanks!

@karan00713
Copy link

karan00713 commented Feb 15, 2024

@novak2000 i'm too having this issue, I tried on my laptop in CPU when i try to use Embed4all or SentenceTransformer, both had huge increase in memory after each request, kindly let me know if you found any solutions

@OlivierDehaene
Copy link
Member

It seems its linked to an issue with Hyper: hyperium/hyper#1790
#161 solves the issue by using another memory allocator.

@A-Posthuman
Copy link

I seem to still be running into this issue of steadily growing mem usage of the text-embeddings-router process. This is with TEI 1.1.0, but I also tested 1.0.0 with same results.

Running with docker:

docker run --name tei --gpus all -e CUDA_MEMORY_FRACTION=1.0 -p 8081:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.1.0 --model-id $model --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

model is: BAAI/bge-small-en-v1.5

@OlivierDehaene OlivierDehaene reopened this Mar 7, 2024
@OlivierDehaene
Copy link
Member

Do you have a graph of the memory increase? And if you have v1.0.0 vs v1.1.0 that would be amazing.

@A-Posthuman
Copy link

I don't have a pretty graph, but here are 3 ps outputs over the past 24 hrs, the first one is from just after starting the docker image, the 2nd is from not long after, and the 3rd is from a minute ago where you can see the mem percentage has grown to 8.2% of the server's ram, from the first output's 3.6%

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     3341091 12.0  3.6 54668960 583220 ?     Ssl  18:48   0:01 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

root     3341091  2.5  3.9 54762532 638616 ?     Ssl  18:48   0:51 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

root     3341091 65.2  8.2 55811112 1338148 ?    Ssl  Mar06 594:48 text-embeddings-router --model-id BAAI/bge-small-en-v1.5 --tokenization-workers 4 --max-batch-tokens 131072 --max-batch-requests 1024 --pooling cls

@hiepxanh
Copy link

hiepxanh commented Mar 7, 2024

I'm embedding with default config with 1 milion vector without any issue, maybe worker cause leak?

@A-Posthuman
Copy link

BTW I forgot to mention regarding 1.0.0 vs 1.1.0, I tried both and performance seemed similar in regards to the mem use growing.

The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.

@OlivierDehaene
Copy link
Member

Ok I will keep this in my prio list but it must be very deep in the stack and might take some time to find.

The worker/client program in my case is on a separate server, and the embedding throughput it's inferencing is in the range of 5 to 10 million requests to the TEI server per 24 hrs.

That's great :) It's always nice to hear that the project is running in prod with some real throughput requirements.

@A-Posthuman
Copy link

Ok if you need any other details let me know. The instance is on AWS, a g5.xlarge (1 nvidia A10G gpu), using the AMI:

Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20231103
id: ami-0ac1f653c5b6af751

the gpu is being shared, 90% of it is a separate vllm text generation server, the other 10% gets used by TEI.

@novak2000
Copy link
Author

Just to mention that I'm also running into the same issue again.
I'm using version 1.0
image

@OlivierDehaene
Copy link
Member

@novak2000 can you use 1.1 and keep the memory resources limit? I'm wondering if the container will be killed or not on 1.1.

@novak2000
Copy link
Author

I'm sending you docker stats before and after running a simple test, with around 25k requests to the server(each of the requests has between 100 and 1000 texts to embed and ~1000 texts to rerank)

models used:
reranker: BAAI/bge-reranker-base
embedding: sentence-transformers/multi-qa-MiniLM-L6-cos-v1

before:
image

after ~10k requests(it looked like they were working stable just beneath the memory limit):
image

after ~20k requests, the embedding server got killed and restarted on failure:
image

image

Let me know if you need more details

@novak2000
Copy link
Author

I ran the tests again, and this time both services were killed

graph of memory consumption:
image

@OlivierDehaene
Copy link
Member

Ok thanks for this info.
I'm more or less off this week so I will keep digging when I find the time.

@djanito
Copy link

djanito commented Jun 21, 2024

Any news on this ? I'm running into the same issue and it's not usable in production.

@OlivierDehaene
Copy link
Member

Yes it seems that there was a leak in one of our dependency. This is othogonal to the problem of allocator reported above.
We updated the depency and added a logic to trim os pages in #307.

See: https://www.algolia.com/blog/engineering/when-allocators-are-hoarding-your-precious-memory/ for more info on the subject.

I will release 1.3 with this PR today. Will you be able to test it and report if the problem is indeed fixed?

@djanito
Copy link

djanito commented Jun 28, 2024

I can try it today if you want but I don't see the 1.3 release for the moment.

@OlivierDehaene
Copy link
Member

It's released now.

@OlivierDehaene
Copy link
Member

@djanito, were you able to try it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants