Skip to content
This repository has been archived by the owner on Feb 15, 2025. It is now read-only.

(spike) Determine open source LLMs to evaluate on #717

Closed
jalling97 opened this issue Jul 2, 2024 · 9 comments
Closed

(spike) Determine open source LLMs to evaluate on #717

jalling97 opened this issue Jul 2, 2024 · 9 comments
Assignees
Labels

Comments

@jalling97
Copy link
Contributor

jalling97 commented Jul 2, 2024

Description

As part of the deliverables for an MVP Evals framework, we will need a short list of LLMs we are evaluating on as part of LFAI. The models chosen should fit the following criteria (with room for potential exceptions):

  • Fewer than 5 total models (excess evals at the beginning is information overload, requires extra time, and may not be valuable out of the gate)
  • Models should be foundational (avoid hyper-specific fine-tuned models)
  • Models should be Apache-2.0 (or more permissible)
  • Models should be able to fit on 12-16Gb VRAM

Relevant Links

Inspiration can be found on the HuggingFace Open LLM Leaderboard

@jalling97
Copy link
Contributor Author

jalling97 commented Aug 20, 2024

Since we implement SynthIA-7B as our default model, it will be included in the list of initial evaluations.

We have also done our own quantization of a fine-tuned Mistral-7B, so this should be another model used in the initial evaluations.

@jalling97
Copy link
Contributor Author

Potential models for consideration:

@jalling97
Copy link
Contributor Author

jalling97 commented Aug 20, 2024

In order to balance scope and time; the following models will be used:

  • SynthIA-7B (default model for now)
  • Hermes2 (Defense Unicorns quantization)
  • LLama3.1-8B (pending license approval)
    • If not available, Phi-3-small (8k context)

GPT-4o will also be used as a point of comparison in the results.

@jxtngx
Copy link

jxtngx commented Sep 3, 2024

Hi, @jalling97! Have you seen NVIDIA's distilled Llama 3.1 8B models?

  1. https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base
  2. https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base

Here's the paper on model distillation, if interested: https://arxiv.org/abs/2408.11796

@jalling97
Copy link
Contributor Author

Hey @jxtngx! I have not, I'll be sure to take a look at these. The primary issue when it comes to using LLama3.1 is the license. We're still not sure based on the wording whether or not we're able to use it. Not being an expert in the legal side of things, I'm not sure if the NVIDIA Open Model License changes any of that, but regardless I'll be sure to mess around with these models anyways. The benchmark comparison looks promising. Thanks for sending this over!

@jxtngx
Copy link

jxtngx commented Sep 3, 2024

There's also Nvidia's own Minitron 4B and 8B. Again though – the proprietary license may be a blocker.

I've added all of the models to this collection for easy reference:

https://huggingface.co/collections/jxtngx/nvidia-minitron-models-66d714aebae0e60d003a9693

@justinthelaw
Copy link
Contributor

justinthelaw commented Sep 3, 2024

@jalling97 the 24Gb vRAM requirement should be lowered to 16Gb (or 12Gb ideally), and also reflected back into our documentation. If we are considering lower-end GPUs (e.g., laptop GPUs, v100s, etc.) and/or laptop CPU RAM - think government laptop with 16Gb of RAM (13Gb free in ideal situation) - as the only offloading target in a worst case scenario, then we should reduce our minimums.

If the "example/demo" model for our repository is to fit into the minimum requirement laptops and machines, then we should focus on heavily quantized, pruned and/or lower param models (down to ~2B effective params). Until we activate vLLM's (still not prod ready though), or another engine's, CPU + GPU offloading, we are restrained to one or the other's compute RAM.

Other factors besides model parameter size include architecture, context size, and engine parameters we may be able to tune to lower vRAM usage at the edge. An oversimplified example of these considerations can be seen in the ADDITIONAL CONTEXT section of this WIP PR: #854 (comment).

Additional parameters for vLLM can be found here: https://docs.vllm.ai/en/v0.4.3/models/engine_args.html. Pinned to v0.4.3 due to issues in 0.5.x, as described in the aforementioned WIP PR above.

@jalling97
Copy link
Contributor Author

Good call out! When I listed the 24Gb vRAM requirement, I don't think I properly conveyed that the intention was that the model should be fully "usable" on 24Gb vRAM (i.e not reaching OOM errors when using the model's full context). While keeping that determination of success the same, I agree that it makes sense to reduce that vRAM cap to 12Gb or 16Gb. I'll try to keep to 12Gb but may increase to 16Gb if a very promising model comes into play that requires it. Happy to discuss further if you like.

@jalling97
Copy link
Contributor Author

jalling97 commented Sep 10, 2024

This research spike will be closed. The three models that will initially be evaluated are going to be:

  • SynthIA-7B (default model for now)

  • Hermes2 (Defense Unicorns quantization)

  • LLama3.1-8B

  • GPT-4o will also be used as a point of comparison in the results.

@jalling97 jalling97 changed the title (spike) Determine Open Source LLMs for Initial Evaluations (spike) Determine open source LLMs to evaluate on Sep 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants