How to Run LLMs Locally: Best Tools & Optimizations

Running large language models (LLMs) locally requires a combination of hardware, software, and optimization techniques.

1. Using Pre-built Tools & UI-Based Runtimes

These tools make it easy to run LLMs without deep technical knowledge:

LM Studio – A simple desktop app to run local LLMs (supports GGUF models).
Ollama – Lightweight LLM runner with built-in model downloads (ollama run mistral).
LocalAI - A open-source tool that provides a simplified interface for running various models directly on local machines
GPT4All – A GUI-based tool for running various LLMs locally.

For more control, you can use Python frameworks:

Text Generation WebUI – A feature-rich web UI for running LLMs locally.
llama.cpp – A C++-based lightweight framework for running LLaMA models using GGUF quantization.
transformers (Hugging Face) – Use with AutoModelForCausalLM and bitsandbytes for efficient model execution.

Since LLMs are memory-intensive, consider these optimizations:

Use Quantized Models (GGUF, GPTQ, AWQ, etc.) – Reduce VRAM usage while maintaining good quality.
Use Flash Attention & LoRA Fine-Tuning – Improves inference speed and reduces memory usage.
Enable CUDA Acceleration – Ensures the RTX 4090 is utilized for faster inference (--use-cuda in llama.cpp).

The type of GPU you need depends on the model size and optimizations:

4GB VRAM – Small models (e.g., Mistral 7B with heavy quantization like 4-bit GGUF).
8GB VRAM – Mid-sized models (e.g., LLaMA 2 7B, Mistral 7B with moderate quantization).
12GB VRAM – Can run 7B models at full precision or 13B models with quantization.
16GB VRAM – Good for 13B models without quantization or 30B models with offloading.
24GB+ VRAM (e.g., RTX 4090, A6000) – Can run 30B models comfortably and even 65B models with optimizations.
Multi-GPU setups (e.g., Dual RTX 3090s, A100s, H100s) – Required for 65B+ models at full precision.

For those with lower VRAM, techniques like CPU offloading, quantization (GGUF, GPTQ, AWQ), and tensor parallelism can help run larger models.