-
Notifications
You must be signed in to change notification settings - Fork 43
llamacpp_en
This wiki will walk you through the detailed steps of model quantization and local deployment using the llama.cpp. Windows users may need to install a compiler such as cmake. For quick local deployment, it is recommended to use the instruction-tuned Mixtral-Instruct model. If possible, use the 6-bit or 8-bit models for better results. Before running, ensure that:
- Your system has the
make
(built-in for MacOS/Linux) orcmake
(needs to be installed on Windows) compiler tool - It is recommended to use Python 3.10 or above to compile and run this tool
- (Optional) If you have downloaded an old repository, it is recommended to pull the latest code with
git pull
and clean it withmake clean
- Pull the latest code from the llama.cpp repository
$ git clone https://github.com/ggerganov/llama.cpp
- Compile the llama.cpp project to generate the
./main
(for inference) and./quantize
(for quantization) binary files.
$ make
-
Windows/Linux users: It is recommended to compile with BLAS (or cuBLAS if you have a GPU) to speed up prompt processing. See llama.cpp#blas-build for reference.
-
macOS users: No additional steps required. llama.cpp is optimized for ARM NEON and has BLAS enabled by default. For M-series chips, it is recommended to use Metal for GPU inference to significantly improve speed. To do this, change the compilation command to
LLAMA_METAL=1 make
. See llama.cpp#metal-build for reference.
(💡 You can also directly download quantized model:GGUF models)
Convert the full model weights (either .safetensors
format or .bin
format) to GGUF FP16 format.
$ python convert.py chinese-mixtral-instruct/
$ ./quantize chinese-mixtral-instruct/ggml-model-f16.gguf chinese-mixtral-instruct/ggml-model-q4_0.gguf q4_0
Since the Chinese-Mixtral-Instruct launched by this project uses the instruction template of Mixtral-8x7B-Instruct-v0.1, please first copy scripts/llama-cpp/chat.sh
of this project to the root directory of llama.cpp. The content of the chat.sh
file is as follows, and the chat template and some default parameters are nested inside, which can be modified according to the actual situation.
- GPU Inference: if compiled with cuBLAS/Metal, specify the offload layers, e.g.,
-ngl 40
means offloading 40 layers of model parameters to GPU
./main -m $1 --color --interactive-first \
-c 4096 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 \
--in-prefix ' [INST] ' --in-suffix ' [/INST]'
Start chatting with the following command.
$ chmod +x chat.sh
$ ./chat.sh ggml-model-q4_0.gguf
After the >
prompt, enter your prompt. Use cmd/ctrl+c
to interrupt output, and use \
at the end of the line for multiline input. For help and parameter descriptions, execute the ./main -h
command. Here are some common parameters:
For a more detailed official description, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ