Skip to content

inference_with_transformers_en

iMountTai edited this page Jan 30, 2024 · 3 revisions

Inference with transformers

We provide a command-line approach to inference using native Transformers. Taking the loading of the Chinese-Mixtral-Instruct model as an example:

Using native Transformers

If you are using full model, or have already merged the LoRA models with Mixtral-8x7B-v0.1 using merge_mixtral_with_chinese_lora_low_mem.py, you can directly perform inference .

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive

Accelerate Inference with vLLM

This method also supports use vLLM for LLM inference and serving. You need to install vLLM:

pip install vllm

Simply add the --use_vllm argument to the original command line.

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive \
    --use_vllm

Parameter Description

  • --base_model {base_model}: Directory containing the chinese-mixtral model weights and configuration files in HF format.
  • --tokenizer_path {tokenizer_path}: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as --base_model.
  • --with_prompt: Whether to merge the input with the prompt template. If you are loading an mixtral-instruct model, be sure to enable this option!
  • --interactive: Launch interactively for multiple single-round question-answer sessions (this is not the contextual dialogue in llama.cpp).
  • --data_file {file_name}: In non-interactive mode, read the content of file_name line by line for prediction.
  • --predictions_file {file_name}: In non-interactive mode, write the predicted results in JSON format to file_name.
  • --only_cpu: Only use CPU for inference.
  • --gpus {gpu_ids}: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance 0,1,2.
  • --load_in_8bit or --load_in_4bit:Load the model in the 8bit or 4bit mode. Recommended use --load_in_4bit.
  • --use_vllm:use vLLM as LLM backend for inference and serving.
  • --use_flash_attention_2: use Flash-Attention to speed up inference. If this parameter is not specified, the code defaults to SDPA acceleration.

Note

  • This script is for convenient and quick experience only, and has not been optimized for fast inference.
  • The weight of mixtral's model is 87G, and it is recommended to load inference with 4bit, which still requires 26G memory (video memory), and llama.cpp is recommended for interactive inference.