batched inference #108

varshith15 · 2024-07-31T22:21:32Z

batched inference llama unit testcase
batched inference llava unit testcase
take list as input for prompt, image_str, request_id
fix callbacks order
online batching in chatgpt api
top_p!=1 for batched patch
early stopping in the batch if the stop has eos occurred before, currently only check the last token
online batching conditions update (timeout, max_batch etc)

varshith15 · 2024-07-31T22:23:15Z

hey @AlexCheema
to start with, i am assuming that inflight batching isnt needed right away (have to revamp a lot the api to make inflight happen)
we can just batch requests in the api section and do inference and split them back in the api section thoughts?

AlexCheema · 2024-08-01T00:37:21Z

hey @AlexCheema to start with, i am assuming that inflight batching isnt needed right away (have to revamp a lot the api to make inflight happen) we can just batch requests in the api section and do inference and split them back in the api section thoughts?

Sounds good. Yes, no need for inflight batching - we can add that in a subsequent PR.
Lets do this!

Related issue: #1

varshith15 · 2024-08-12T21:03:14Z

Hey @AlexCheema
I was just working on the batching of requests at the endpoint part but it just seems a little gimmck-y to online batch it ourselves and also issues concerning stream callback

My current idea for online batching is to basically combine the requests ("req1_req2_req3..") and then check if req_id is present in the broadcasted result id and split it based on the position of the id but seems a bit gimmick-y

I think its just better to process requests concurrently using https://github.com/omnilib/aiomultiprocess (#4) rather than online batching them

If the user gives a batch as input we could process it but i think its best that we dont online batch it

thoughts?

varshith15 · 2024-08-13T21:42:42Z

ive thought of a better way, just need to update functions to take list of req_ids, prompts and img_strs and broadcast back the specific results based on the index of the req in the back - that way dont have to change a lot of code in the chatgpt api

varshith15 · 2024-08-17T19:04:43Z

hey @AlexCheema, batched inference with online batching works now. There are few more small patches required, ill push them in a bit. Could you please review the idea meanwhile.

varshith15 · 2024-08-18T11:11:39Z

@AlexCheema it's done! PRM

AlexCheema · 2024-08-18T15:34:26Z

Thanks @varshith15 I will take a proper look at some point next week.

AlexCheema · 2024-08-18T20:14:21Z

I haven't had a proper look yet, but I'm wondering what the behaviour is when a request is being processed and another request comes in?

varshith15 · 2024-08-18T20:49:55Z

currently the requests just keep getting processed one after the other, by processes i mean after the await process_prompt is done but maybe a semaphore is needed

AlexCheema · 2024-08-18T21:12:18Z

currently the requests just keep getting processed one after the other, by processes i mean after the await process_prompt is done but maybe a semaphore is needed

I think this might cause issues currently since the kv_cache is shared. Can you test this out to see if it works? It doesn't work on main but maybe with your changes it works now.

varshith15 · 2024-08-18T21:38:14Z

I've tested it out a couple of different scenarios with sending the reqs to the service, it works as expected
trying more testing

varshith15 · 2024-08-19T10:56:18Z

currently the requests just keep getting processed one after the other, by processes i mean after the await process_prompt is done but maybe a semaphore is needed

I think this might cause issues currently since the kv_cache is shared. Can you test this out to see if it works? It doesn't work on main but maybe with your changes it works now.

@AlexCheema i think you are right, there a bug when theres a request running already and a new request comes, the older one is not able to return the result, can you expand on what do you mean by the issue is due to kv_cache being shared?

varshith15 · 2024-08-22T11:44:04Z

@AlexCheema could you expand on why concurrent processing doesn't work as the kv_csche is shared?

I haven't debugged yet, will do it over the weekend

AlexCheema · 2024-08-23T18:02:50Z

@AlexCheema could you expand on why concurrent processing doesn't work as the kv_csche is shared?

I haven't debugged yet, will do it over the weekend

I pushed something for MLX that uses a LRU cache with a bunch of KV caches that should fix it. Can you merge the latest exo and fix conflicts then I will take a look through the PR.

varshith15 · 2024-08-24T22:52:20Z

exo/helpers.py


  def on_next(self, callback: Callable[..., None]) -> None:
    self.observers.append(callback)

  def set(self, *args: T) -> None:
-    self.result = args
+    self.result.append(args)


@AlexCheema the concurrent execution issue is due to this

varshith15 · 2024-08-24T22:54:49Z

@AlexCheema this is done, works as expected with online batching and concurrent requests now, please review

so a few things

need to add semaphore functionality in chatgpt api matching the LRU cache number
ive moved all the functions from taking a single request_id to a list of request_ids, ive updated model inference part for the MLX part, please comment on the other engines

varshith15 marked this pull request as ready for review August 18, 2024 11:11

varshith15 commented Aug 24, 2024

View reviewed changes

varshith15 closed this Sep 10, 2024

varshith15 force-pushed the main branch from 16f5046 to e0ed917 Compare September 10, 2024 17:51

varshith15 mentioned this pull request Sep 10, 2024

batched inference #214

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batched inference #108

batched inference #108

varshith15 commented Jul 31, 2024 •

edited

Loading

varshith15 commented Jul 31, 2024 •

edited

Loading

AlexCheema commented Aug 1, 2024

varshith15 commented Aug 12, 2024 •

edited

Loading

varshith15 commented Aug 13, 2024

varshith15 commented Aug 17, 2024

varshith15 commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

varshith15 commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

varshith15 commented Aug 18, 2024 •

edited

Loading

varshith15 commented Aug 19, 2024 •

edited

Loading

varshith15 commented Aug 22, 2024 •

edited

Loading

AlexCheema commented Aug 23, 2024

varshith15 Aug 24, 2024 •

edited

Loading

varshith15 commented Aug 24, 2024 •

edited

Loading

batched inference #108

batched inference #108

Conversation

varshith15 commented Jul 31, 2024 • edited Loading

varshith15 commented Jul 31, 2024 • edited Loading

AlexCheema commented Aug 1, 2024

varshith15 commented Aug 12, 2024 • edited Loading

varshith15 commented Aug 13, 2024

varshith15 commented Aug 17, 2024

varshith15 commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

varshith15 commented Aug 18, 2024

AlexCheema commented Aug 18, 2024

varshith15 commented Aug 18, 2024 • edited Loading

varshith15 commented Aug 19, 2024 • edited Loading

varshith15 commented Aug 22, 2024 • edited Loading

AlexCheema commented Aug 23, 2024

varshith15 Aug 24, 2024 • edited Loading

Choose a reason for hiding this comment

varshith15 commented Aug 24, 2024 • edited Loading

varshith15 commented Jul 31, 2024 •

edited

Loading

varshith15 commented Jul 31, 2024 •

edited

Loading

varshith15 commented Aug 12, 2024 •

edited

Loading

varshith15 commented Aug 18, 2024 •

edited

Loading

varshith15 commented Aug 19, 2024 •

edited

Loading

varshith15 commented Aug 22, 2024 •

edited

Loading

varshith15 Aug 24, 2024 •

edited

Loading

varshith15 commented Aug 24, 2024 •

edited

Loading