-
Notifications
You must be signed in to change notification settings - Fork 12
nginx-uvicorn-asyncio #34
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vzip thank you for this great iteration, application starts with no issues, but the response format is a bit off and we cant autotest it --
{"worker4":{"EIStakovskii":{"score":0.9995502829551697,"label":"LABEL_0"}},"worker3":{"svalabs":{"score":0.9966347813606262,"label":"SPAM"}},"worker5":{"jy46604790":{"score":0.9940044283866882,"label":"LABEL_0"}},"worker1":{"cardiffnlp":{"score":0.4247715175151825,"label":"POSITIVE"}},"worker2":{"ivanlau":{"score":0.1369515061378479,"label":"Maltese"}}}
("worker" keys are redundant)
oops. @rsolovev changed, please, run test again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rsolovev thank you. That is not perfect) i don't understand waht going different on your test env with my) but all scores a twice down different. Will explore more both my solution, will add a lot of model optimization and will try use yours version nvidia driver 11.4 (mine is 12.0) |
@rsolovev Hi. Please run test. p.s. I have been slightly out of the loop due to unforeseen circumstances (my girlfriend went missing while I live in Mexico, and it was quite an emergency situation, but everything turned out fine, and she was found safe and sound!). Over the weekend, I resumed my research and delved into working with T4 GPU memory and maximizing task loads in a single batch. I discovered that it's not always ideal to load too many tasks onto the GPU because the driver settings have certain configurations. If the GPU is fully loaded, the frequency and computing throughput naturally decrease. For now, I have found the optimal configuration with 5 tasks in 1 batch, resulting in an average frequency of 900MHz. However, I'm still experimenting with running two applications in parallel, but there are some issues since I'm using Python and threading in this context. My plan is to either try multiprocessing or rewrite everything in JavaScript. I'm still striving for the best possible outcome without using ONNX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello everyone. This is the second concept based on nginx that distributes requests according to the round-robin scheme into two applications running in parallel from under uvicorn, each on its own port, each application includes 5 models that receive tasks each in queue, the results are added to a result_queue, asynchronous function pulls out ready tasks and sets futures. At the local stand, more than 9000 passes. I had tried many different combos with gunicorn and two workers but that can run only on 2 cpu 2 workers, another variant make two async flows inside app with 2 queues but because of py is not true async it slowly.
p.s. this is all without any model optimizations, since converting models to onnx is already more like a model test and this is not a runtime, which I plan to add to the current concept and also to the first v1.0b with tornado and redis solution, only there i'm need to deploy the redis inside the docker with the application, otherwise on the test bench here it is probably somewhere far away and network adding an extra 100 + ms I think.
p.s.s. and i saw some participants try use one tokenizer for all models - it is wrong, because it’s different for each model and it can tokenize spaces and special characters in different ways and that can give inaccurate results.