-
Notifications
You must be signed in to change notification settings - Fork 12
Electriclizard solution3 #23
base: main
Are you sure you want to change the base?
Conversation
Single tokenize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @electriclizard, on our tests this version held up to 30 docs/sec throughput on highload_scenario
(+5/+3 increase in rps compared to previous iterations), but in the middle of the next stage, CUDA reported OOM, here is a log sample:
...
INFO: 10.147.2.177:47382 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:47180 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:47388 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:47374 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:42462 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:47166 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:42490 - "POST /process HTTP/1.1" 200 OK
INFO: 10.147.2.177:42456 - "POST /process HTTP/1.1" 200 OK
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<PredictionHandler.handle() done, defined at /src/handlers/recognition.py:37> exception=OutOfMemoryError('CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')>
Traceback (most recent call last):
File "/src/handlers/recognition.py", line 51, in handle
outs = model(inputs)
File "/src/infrastructure/models.py", line 70, in __call__
logits = self.model(**inputs).logits
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1226, in forward
outputs = self.roberta(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 854, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 528, in forward
layer_outputs = layer_module(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 412, in forward
self_attention_outputs = self.attention(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 339, in forward
self_outputs = self.self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 259, in forward
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<PredictionHandler.handle() done, defined at /src/handlers/recognition.py:37> exception=OutOfMemoryError('CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')>
...
Hmm, it is interesting, i'll try to reproduce the error later, thank you for a log! |
Hey @rsolovev i've fixed the out of memmory issue, successfully ran all k6 tests with no failures on my local gpu, so waiting for the new results |
Hey there, @electriclizard, could you please check your email inbox and specifically emails from the @inca.digital domain. |
Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@electriclizard thank you, here are the results for the latest commit from this branch
strange results 🤔, on the local tests it was the most efficient solution, because of one tokenization for all models |
sure it could, but we're trying to launch tests and solutions isolated from the rest of the infrastructure to minimise these outside variables. let me check that if that was the right commit's tests |
@electriclizard -- here are the results for the restart -- grafana |
Hello!
I've a some kind of hack solution. Every model has it's own text tokenizer and it rans five times before each model. So i've tried to use a single roberta-tokenizer for all models, it puts the data to the device(gpu) only one time and models get the data adress. It has some issues with model answers and this approach needs to be validated on the test dataset. But it works faster on my local tests and gives us an ability to train all models that we need with one tokenizer and get some perfomance growth.