Skip to content
This repository has been archived by the owner on Jun 25, 2023. It is now read-only.

Electriclizard solution3 #23

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

electriclizard
Copy link
Contributor

Hello!
I've a some kind of hack solution. Every model has it's own text tokenizer and it rans five times before each model. So i've tried to use a single roberta-tokenizer for all models, it puts the data to the device(gpu) only one time and models get the data adress. It has some issues with model answers and this approach needs to be validated on the test dataset. But it works faster on my local tests and gives us an ability to train all models that we need with one tokenizer and get some perfomance growth.

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @electriclizard, on our tests this version held up to 30 docs/sec throughput on highload_scenario (+5/+3 increase in rps compared to previous iterations), but in the middle of the next stage, CUDA reported OOM, here is a log sample:

...
INFO:     10.147.2.177:47382 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:47180 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:47388 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:47374 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:42462 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:47166 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:42490 - "POST /process HTTP/1.1" 200 OK
INFO:     10.147.2.177:42456 - "POST /process HTTP/1.1" 200 OK
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<PredictionHandler.handle() done, defined at /src/handlers/recognition.py:37> exception=OutOfMemoryError('CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')>
Traceback (most recent call last):
  File "/src/handlers/recognition.py", line 51, in handle
    outs = model(inputs)
  File "/src/infrastructure/models.py", line 70, in __call__
    logits = self.model(**inputs).logits
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1226, in forward
    outputs = self.roberta(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 854, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 528, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 412, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 339, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 259, in forward
    attention_scores = attention_scores / math.sqrt(self.attention_head_size)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<PredictionHandler.handle() done, defined at /src/handlers/recognition.py:37> exception=OutOfMemoryError('CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 14.76 GiB total capacity; 13.00 GiB already allocated; 28.75 MiB free; 13.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')>

...

@electriclizard
Copy link
Contributor Author

Hmm, it is interesting, i'll try to reproduce the error later, thank you for a log!

@electriclizard
Copy link
Contributor Author

Hey @rsolovev i've fixed the out of memmory issue, successfully ran all k6 tests with no failures on my local gpu, so waiting for the new results

@darknessest
Copy link
Collaborator

Hey there, @electriclizard, could you please check your email inbox and specifically emails from the @inca.digital domain.

@electriclizard
Copy link
Contributor Author

Hey there, @electriclizard, could you please check your email inbox and specifically emails from the @inca.digital domain.

Done!

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electriclizard thank you, here are the results for the latest commit from this branch

@electriclizard
Copy link
Contributor Author

electriclizard commented May 31, 2023

@electriclizard thank you, here are the results for the latest commit from this branch

strange results 🤔, on the local tests it was the most efficient solution, because of one tokenization for all models
telegram-cloud-photo-size-2-5454396707308685690-y
may the rps depends on the network speed and stability?
I understand that rps is lower because i made tests in local network, but anyway this solution was more efficient than other previous

@rsolovev
Copy link
Collaborator

may the rps depends on the network speed and stability?

sure it could, but we're trying to launch tests and solutions isolated from the rest of the infrastructure to minimise these outside variables. let me check that if that was the right commit's tests

@rsolovev
Copy link
Collaborator

@electriclizard -- here are the results for the restart -- grafana

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants