inference_with_transformers_en

Inference with transformers

We provide a command-line approach to inference using native Transformers. Taking the loading of the Chinese-Mixtral-Instruct model as an example:

Using native Transformers

If you are using full model, or have already merged the LoRA models with Mixtral-8x7B-v0.1 using merge_mixtral_with_chinese_lora_low_mem.py, you can directly perform inference .

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive

Accelerate Inference with vLLM

This method also supports use vLLM for LLM inference and serving. You need to install vLLM:

pip install vllm

Simply add the --use_vllm argument to the original command line.

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive \
    --use_vllm

Parameter Description

--base_model {base_model}: Directory containing the chinese-mixtral model weights and configuration files in HF format.
--tokenizer_path {tokenizer_path}: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as --base_model.
--with_prompt: Whether to merge the input with the prompt template. If you are loading an mixtral-instruct model, be sure to enable this option!
--interactive: Launch interactively for multiple single-round question-answer sessions (this is not the contextual dialogue in llama.cpp).
--data_file {file_name}: In non-interactive mode, read the content of file_name line by line for prediction.
--predictions_file {file_name}: In non-interactive mode, write the predicted results in JSON format to file_name.
--only_cpu: Only use CPU for inference.
--gpus {gpu_ids}: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance 0,1,2.
--load_in_8bit or --load_in_4bit：Load the model in the 8bit or 4bit mode. Recommended use --load_in_4bit.
--use_vllm：use vLLM as LLM backend for inference and serving.
--use_flash_attention_2: use Flash-Attention to speed up inference. If this parameter is not specified, the code defaults to SDPA acceleration.

Note

This script is for convenient and quick experience only, and has not been optimized for fast inference.
The weight of mixtral's model is 87G, and it is recommended to load inference with 4bit, which still requires 26G memory (video memory), and llama.cpp is recommended for interactive inference.

中文文档

English Docs

Model Reconstruction
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly