batched inference #214

varshith15 · 2024-09-10T18:23:40Z

Original PR: #108

~~It was built on top of older code, needs a lot of refactoring~~ refactored now
currently on pause - need to think broadly for distributed training on exo

…into batched_inference

AlexCheema · 2024-09-24T18:04:04Z

Does this work as expected? If you look at mlx_parallm, they use a BatchedKVCache implementation to handle the kv cache for batches https://github.com/willccbb/mlx_parallm/blob/80b18ab49b80e6f8d82d89347ab32f44b35f8942/mlx_parallm/utils.py#L201

I'm not sure how it would work with the current implementation. It looks like one cache is used for all the requests which is probably not want we want here.

varshith15 · 2024-09-25T05:31:18Z

@AlexCheema https://github.com/willccbb/mlx_parallm/blob/80b18ab49b80e6f8d82d89347ab32f44b35f8942/mlx_parallm/utils.py#L201 is exactly the same as what I did, it's just that it stores batch_size as a variable, i just infer it from the input shape

Varshith and others added 5 commits September 10, 2024 23:44

batched inference

c4a7de4

cleanup

24c5ac2

Merge branch 'main' into batched_inference

5150a42

cleanup

722e42c

Merge branch 'batched_inference' of https://github.com/varshith15/exo …

c8696cc

…into batched_inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batched inference #214

batched inference #214

varshith15 commented Sep 10, 2024 •

edited

Loading

AlexCheema commented Sep 24, 2024

varshith15 commented Sep 25, 2024 •

edited

Loading

batched inference #214

Are you sure you want to change the base?

batched inference #214

Conversation

varshith15 commented Sep 10, 2024 • edited Loading

AlexCheema commented Sep 24, 2024

varshith15 commented Sep 25, 2024 • edited Loading

varshith15 commented Sep 10, 2024 •

edited

Loading

varshith15 commented Sep 25, 2024 •

edited

Loading