- Optimizated configuration for Lumo 8B Instruct model with 1.5x to 3x speed gains.
- Benchmark on AWS G6e.2xLarge (Nvidia L40S)
- Get nvidia L40S or above GPU instance
- Ubuntu VM preferred
- Install Docker, nVidia-cuda drivers, nvidia tool kit
- Install Ollama (optional; for comparison or to reproduce these tests)
cd src
# Build vLLM docker image and run
./run.sh
# OR manual mode
docker build . -t vllm-gguf
# Run the container
docker run --gpus all \
--shm-size 16g \
-p 8000:8000 \
vllm-gguf
- Benchmark
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python src/test_model.py
- More optimizations to Docker image
- Image on Docker hub
- Serverless configuration (on runpod, koyeb etc).