v1.0b #25

vzip · 2023-05-30T08:55:21Z

Hi everybody. This is first example based on tornado for keeping req's for resp's , redis with multiple db's and pipelines, ioredis and Asyncio for tasker, and that's all. It's pretty simple but work stable and fast. Easy to extend the clusters of workers for expand power for processing on ML tasks. Because it's only one part in all app that making queue. I did tests on t2.xlarge aws ec2 and all processing was on CPU, i ran 5 ml worker instances it eating 8gb ram stable, and 10ml workers like a 2 cluster it is making average responds quicker x2. I plan early days make tests on GPU.

ps. config.py need to be update for make settings for more clusters run together - soon will released

Thank you All and have a good time!

solution/dev_parallel/server.py

solution/Dockerfile

…h:*-cuda*-cudnn*-runtime & update Dockerfile

rsolovev

@vzip, the latest solution launched with errors, here are full logs:
inca-smc-mlops-challenge-solution-84c9b6cf74-z5jwt.log

vzip · 2023-05-31T16:31:15Z

@rsolovev, found a problem with with wrong /dir in supervisord.conf - solved.

…d try ping

…y files

vzip · 2023-05-31T22:10:41Z

amazon approved for me the g4dn.2xlarge instance, will try run and optimise amount of workers in solution

vzip · 2023-06-01T07:19:42Z

please run test

rsolovev

@vzip here are the logs for the latest commit:
inca-smc-mlops-challenge-solution-758765f579-pwhwd.log

pod is running without restarts, but every curl request (even from pod's localhost) is hanging -- no response. There seems to be no problems with GPU/CUDA --

root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# nvidia-smi 
Thu Jun  1 10:14:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P0    39W /  70W |   7632MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+



root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'Tesla T4'
>>>

although I cant see any logs related to redis (even unsuccessful ones), and redis-related env is set:

root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# echo "$REDIS_HOST $REDIS_PASSWORD"
inca-redis-master.default.svc.cluster.local <redacted>

vzip · 2023-06-01T17:51:24Z

Thank you for run . checked, the problem was that the server did not validate the data. Solved.

vzip · 2023-06-01T22:11:54Z

@rsolovev Added validation of input data in incoming requests. This must fixed previous issue.
Please, try again start test.

p.s. launched on g4dn.2xlarge , look like I can try to fit in one more cluster of workers.

got some tests , in result only 8 workers(models) can fit in 16gb gpu memory. but it can be probably give better results at large amount of task. Queue made like 1,2,3,4,5 and 7,8,9,4,5 (where 4 and 5 solving task for both trains) , will try chose 2 the most faster from this 5 models and put them on double work.

rsolovev

@vzip thank you! api now responds with the intended output, but the key names are a bit off -- it is expected to identify model answers by model's author rather than model's name. Please check this section of readme. Without this format, we won't be able to properly execute tests, can you please change the output? Thank you in advance

p.s the result I've got for my curl was:

curl --request POST \
>   --url http://localhost:8000/process \
>   --header 'Content-Type: application/json' \
>   --data '"I live in London"'

{"twitter-xlm-roberta-base-sentiment": {"score": 0.759354829788208, "label": "NEUTRAL"}, "language-detection-fine-tuned-on-xlm-roberta-base": {"score": 0.9999200105667114, "label": "English"}, "twitter-xlm-roberta-crypto-spam": {"score": 0.8439149856567383, "label": "SPAM"}, "xlm_roberta_base_multilingual_toxicity_classifier_plus": {"score": 0.9999451637268066, "label": "LABEL_0"}, "Fake-News-Bert-Detect": {"score": 0.95546954870224, "label": "LABEL_0"}}

vzip · 2023-06-02T14:54:32Z

@rsolovev The output changed. Now by model's author.

rsolovev

thank you @vzip, now everything is perfect! Here are our tests results for the latest commit. If you want you can improve or optimise your solution and re-requests review. Every contribution will count while choosing a challenge winner.

vzip · 2023-06-02T19:24:40Z

@rsolovev thank you so much for run test. yes i will update now i have prepared conf for fit all video memory and will compare results

vzip · 2023-06-02T19:51:36Z

@rsolovev Please, run another test , will see what more workers can improve.

p.s please give 2 minutes after starting the docker instance so that all workers are loaded into the GPU memory completely before starting the test

vzip · 2023-06-03T02:17:19Z

Question: I noticed that some participants use model optimization approaches, but in the task it is noted that "Model's performance optimization is not allowed." , from the architecture side, the "bottleneck" is the models and the memory they occupy, if exceptions are allowed in this, then please confirm, I can then completely fit the second group of workers into memory or reduce the time to calculate its answer. I see the possibility of improving the results on the maximum volume of incoming requests by 2 times due to the possibility of optimization. Of course, I still have a plan in reserve to completely split the queues of groups of workers, but when 2 workers from the first group help the second group only because 2 gigabytes of GPU memory was not enough to fill the entire group, it's a shame)

darknessest · 2023-06-03T02:49:21Z

Question: I noticed that some participants use model optimization approaches, but in the task it is noted that "Model's performance optimization is not allowed."

Hey there, @vzip, we had an extensive internal debate regarding this, and a compromise we agreed on is:
"As long as the model optimization is done in the runtime, we won't disregard that solution right away"

But please bear in mind, that there won't be any "bonus points" just for a very performant model-optimized solution, as we have various determining factors when choosing the best solution.

vzip · 2023-06-03T22:24:35Z

I assume you are on the weekend, but I could not wait to see the result) and deployed your k6 config, and tested + 3 workers, and the result is + 62% to the throughput, in the network between dockers on 1 host it gives out 9303 parrots. And now i can start try optimise work with the the models methods;)

rsolovev

hey @vzip, thank you for this additions to the submission, here are tests results for the latest commit on our infra

vzip · 2023-06-05T14:04:29Z

@rsolovev Thank you for run test. So strange why I see different in my infra. But it is ok, I swap some workers , please run test it again. And will continue try find more efficiently solution.

p.s. thinking maybe redis not in same host can change timings, will cut redis and look, but in my solution redis giving important thing, if on a long distance some will crushed the taks will be safe and for shure will be done. And next onnx will test, because it change speed of solving tasks by model.

rsolovev

thank you @vzip, here are results for the latest commit

v1.0b

2eec872

vzip requested review from darknessest and rsolovev as code owners May 30, 2023 08:55

change source app

5f94401

rsolovev reviewed May 30, 2023

View reviewed changes

solution/dev_parallel/server.py Outdated Show resolved Hide resolved

solution/Dockerfile Outdated Show resolved Hide resolved

vzip added 3 commits May 30, 2023 15:51

make redis connection arguments env-based & change for pytorch/pytorc…

c45656a

…h:*-cuda*-cudnn*-runtime & update Dockerfile

Dockerfile change FROM for pytorch/pytorch:latest

bb91165

change format responds for autotest

bd4964a

vzip requested a review from rsolovev May 31, 2023 02:30

rsolovev reviewed May 31, 2023

View reviewed changes

vzip added 3 commits May 31, 2023 12:20

added to Dockerfile env vars for redis host

787a866

added logging into server.py to check that env got redis host name an…

edf714c

…d try ping

change dir app in supervisord-conf this cause make problem to start p…

a88c198

…y files

set CUDA env vars in Dockerfile - all ready for run test

b981b7d

rsolovev reviewed Jun 1, 2023

View reviewed changes

added validation of input data in incoming requests

60b55e4

added more validation on incoming reqs

f3e68d1

rsolovev reviewed Jun 2, 2023

View reviewed changes

change key in resp to by model's author from model's name

667c52c

vzip requested a review from rsolovev June 2, 2023 19:00

rsolovev reviewed Jun 2, 2023

View reviewed changes

added more 3 workers

07161fe

vzip requested a review from rsolovev June 2, 2023 19:51

rsolovev reviewed Jun 5, 2023

View reviewed changes

swap some workers

93c0d1f

vzip requested a review from rsolovev June 5, 2023 14:04

rsolovev reviewed Jun 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0b #25

v1.0b #25

vzip commented May 30, 2023

rsolovev left a comment

vzip commented May 31, 2023 •

edited

Loading

vzip commented May 31, 2023 •

edited

Loading

vzip commented Jun 1, 2023

rsolovev left a comment

vzip commented Jun 1, 2023 •

edited

Loading

vzip commented Jun 1, 2023 •

edited

Loading

rsolovev left a comment

vzip commented Jun 2, 2023 •

edited

Loading

rsolovev left a comment •

edited

Loading

vzip commented Jun 2, 2023 •

edited

Loading

vzip commented Jun 2, 2023 •

edited

Loading

vzip commented Jun 3, 2023

darknessest commented Jun 3, 2023

vzip commented Jun 3, 2023

rsolovev left a comment

vzip commented Jun 5, 2023 •

edited

Loading

rsolovev left a comment

v1.0b #25

Are you sure you want to change the base?

v1.0b #25

Conversation

vzip commented May 30, 2023

rsolovev left a comment

Choose a reason for hiding this comment

vzip commented May 31, 2023 • edited Loading

vzip commented May 31, 2023 • edited Loading

vzip commented Jun 1, 2023

rsolovev left a comment

Choose a reason for hiding this comment

vzip commented Jun 1, 2023 • edited Loading

vzip commented Jun 1, 2023 • edited Loading

rsolovev left a comment

Choose a reason for hiding this comment

vzip commented Jun 2, 2023 • edited Loading

rsolovev left a comment • edited Loading

Choose a reason for hiding this comment

vzip commented Jun 2, 2023 • edited Loading

vzip commented Jun 2, 2023 • edited Loading

vzip commented Jun 3, 2023

darknessest commented Jun 3, 2023

vzip commented Jun 3, 2023

rsolovev left a comment

Choose a reason for hiding this comment

vzip commented Jun 5, 2023 • edited Loading

rsolovev left a comment

Choose a reason for hiding this comment

vzip commented May 31, 2023 •

edited

Loading

vzip commented May 31, 2023 •

edited

Loading

vzip commented Jun 1, 2023 •

edited

Loading

vzip commented Jun 1, 2023 •

edited

Loading

vzip commented Jun 2, 2023 •

edited

Loading

rsolovev left a comment •

edited

Loading

vzip commented Jun 2, 2023 •

edited

Loading

vzip commented Jun 2, 2023 •

edited

Loading

vzip commented Jun 5, 2023 •

edited

Loading