-
Notifications
You must be signed in to change notification settings - Fork 32
(spike) Determine open source LLMs to evaluate on #717
Comments
Since we implement SynthIA-7B as our default model, it will be included in the list of initial evaluations. We have also done our own quantization of a fine-tuned Mistral-7B, so this should be another model used in the initial evaluations. |
Potential models for consideration:
|
In order to balance scope and time; the following models will be used:
GPT-4o will also be used as a point of comparison in the results. |
Hi, @jalling97! Have you seen NVIDIA's distilled Llama 3.1 8B models?
Here's the paper on model distillation, if interested: https://arxiv.org/abs/2408.11796 |
Hey @jxtngx! I have not, I'll be sure to take a look at these. The primary issue when it comes to using LLama3.1 is the license. We're still not sure based on the wording whether or not we're able to use it. Not being an expert in the legal side of things, I'm not sure if the NVIDIA Open Model License changes any of that, but regardless I'll be sure to mess around with these models anyways. The benchmark comparison looks promising. Thanks for sending this over! |
There's also Nvidia's own Minitron 4B and 8B. Again though – the proprietary license may be a blocker. I've added all of the models to this collection for easy reference: https://huggingface.co/collections/jxtngx/nvidia-minitron-models-66d714aebae0e60d003a9693 |
@jalling97 the 24Gb vRAM requirement should be lowered to 16Gb (or 12Gb ideally), and also reflected back into our documentation. If we are considering lower-end GPUs (e.g., laptop GPUs, v100s, etc.) and/or laptop CPU RAM - think government laptop with 16Gb of RAM (13Gb free in ideal situation) - as the only offloading target in a worst case scenario, then we should reduce our minimums. If the "example/demo" model for our repository is to fit into the minimum requirement laptops and machines, then we should focus on heavily quantized, pruned and/or lower param models (down to ~2B effective params). Until we activate vLLM's (still not prod ready though), or another engine's, CPU + GPU offloading, we are restrained to one or the other's compute RAM. Other factors besides model parameter size include architecture, context size, and engine parameters we may be able to tune to lower vRAM usage at the edge. An oversimplified example of these considerations can be seen in the ADDITIONAL CONTEXT section of this WIP PR: #854 (comment). Additional parameters for vLLM can be found here: https://docs.vllm.ai/en/v0.4.3/models/engine_args.html. Pinned to v0.4.3 due to issues in 0.5.x, as described in the aforementioned WIP PR above. |
Good call out! When I listed the 24Gb vRAM requirement, I don't think I properly conveyed that the intention was that the model should be fully "usable" on 24Gb vRAM (i.e not reaching OOM errors when using the model's full context). While keeping that determination of success the same, I agree that it makes sense to reduce that vRAM cap to 12Gb or 16Gb. I'll try to keep to 12Gb but may increase to 16Gb if a very promising model comes into play that requires it. Happy to discuss further if you like. |
This research spike will be closed. The three models that will initially be evaluated are going to be:
|
Description
As part of the deliverables for an MVP Evals framework, we will need a short list of LLMs we are evaluating on as part of LFAI. The models chosen should fit the following criteria (with room for potential exceptions):
Relevant Links
Inspiration can be found on the HuggingFace Open LLM Leaderboard
The text was updated successfully, but these errors were encountered: