llamacpp_en

Deploy and Quantize Models Using llama.cpp

This wiki will walk you through the detailed steps of model quantization and local deployment using the llama.cpp. Windows users may need to install a compiler such as cmake. For quick local deployment, it is recommended to use the instruction-tuned Mixtral-Instruct model. If possible, use the 6-bit or 8-bit models for better results. Before running, ensure that:

Your system has the make (built-in for MacOS/Linux) or cmake (needs to be installed on Windows) compiler tool
It is recommended to use Python 3.10 or above to compile and run this tool

Step 1: Clone and Compile llama.cpp

(Optional) If you have downloaded an old repository, it is recommended to pull the latest code with git pull and clean it with make clean
Pull the latest code from the llama.cpp repository

$ git clone https://github.com/ggerganov/llama.cpp

Compile the llama.cpp project to generate the ./main (for inference) and ./quantize (for quantization) binary files.

$ make

Windows/Linux users: It is recommended to compile with BLAS (or cuBLAS if you have a GPU) to speed up prompt processing. See llama.cpp#blas-build for reference.
macOS users: No additional steps required. llama.cpp is optimized for ARM NEON and has BLAS enabled by default. For M-series chips, it is recommended to use Metal for GPU inference to significantly improve speed. To do this, change the compilation command to LLAMA_METAL=1 make. See llama.cpp#metal-build for reference.

Step 2: Generate a Quantized Version of the Model

(💡 You can also directly download quantized model：GGUF models)

Convert the full model weights (either .safetensors format or .bin format) to GGUF FP16 format.

$ python convert.py chinese-mixtral-instruct/
$ ./quantize chinese-mixtral-instruct/ggml-model-f16.gguf chinese-mixtral-instruct/ggml-model-q4_0.gguf q4_0

Step 3: Load and Start the Model

Since the Chinese-Mixtral-Instruct launched by this project uses the instruction template of Mixtral-8x7B-Instruct-v0.1, please first copy scripts/llama-cpp/chat.sh of this project to the root directory of llama.cpp. The content of the chat.sh file is as follows, and the chat template and some default parameters are nested inside, which can be modified according to the actual situation.

GPU Inference: if compiled with cuBLAS/Metal, specify the offload layers, e.g., -ngl 40 means offloading 40 layers of model parameters to GPU

./main -m $1 --color --interactive-first \
-c 4096 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 \
--in-prefix ' [INST] ' --in-suffix ' [/INST]'

Start chatting with the following command.

$ chmod +x chat.sh
$ ./chat.sh ggml-model-q4_0.gguf

After the > prompt, enter your prompt. Use cmd/ctrl+c to interrupt output, and use \ at the end of the line for multiline input. For help and parameter descriptions, execute the ./main -h command. Here are some common parameters:

For a more detailed official description, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main

中文文档

English Docs

Model Reconstruction
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly