-
Notifications
You must be signed in to change notification settings - Fork 43
inference_with_transformers_en
iMountTai edited this page Jan 30, 2024
·
3 revisions
We provide a command-line approach to inference using native Transformers. Taking the loading of the Chinese-Mixtral-Instruct model as an example:
If you are using full model, or have already merged the LoRA models with Mixtral-8x7B-v0.1 using merge_mixtral_with_chinese_lora_low_mem.py
, you can directly perform inference .
python scripts/inference/inference_hf.py \
--base_model path_to_chinese_mixtral_instruct_hf_dir \
--with_prompt \
--interactive
This method also supports use vLLM for LLM inference and serving. You need to install vLLM
:
pip install vllm
Simply add the --use_vllm
argument to the original command line.
python scripts/inference/inference_hf.py \
--base_model path_to_chinese_mixtral_instruct_hf_dir \
--with_prompt \
--interactive \
--use_vllm
-
--base_model {base_model}
: Directory containing the chinese-mixtral model weights and configuration files in HF format. -
--tokenizer_path {tokenizer_path}
: Directory containing the corresponding tokenizer. If this parameter is not provided, its default value is the same as--base_model
. -
--with_prompt
: Whether to merge the input with the prompt template. If you are loading an mixtral-instruct model, be sure to enable this option! -
--interactive
: Launch interactively for multiple single-round question-answer sessions (this is not the contextual dialogue in llama.cpp). -
--data_file {file_name}
: In non-interactive mode, read the content offile_name
line by line for prediction. -
--predictions_file {file_name}
: In non-interactive mode, write the predicted results in JSON format tofile_name
. -
--only_cpu
: Only use CPU for inference. -
--gpus {gpu_ids}
: the GPU id(s) to use, default 0. You can specify multiple GPUs, for instance0,1,2
. -
--load_in_8bit
or--load_in_4bit
:Load the model in the 8bit or 4bit mode. Recommended use--load_in_4bit
. -
--use_vllm
:use vLLM as LLM backend for inference and serving. -
--use_flash_attention_2
: use Flash-Attention to speed up inference. If this parameter is not specified, the code defaults to SDPA acceleration.
- This script is for convenient and quick experience only, and has not been optimized for fast inference.
- The weight of mixtral's model is 87G, and it is recommended to load inference with 4bit, which still requires 26G memory (video memory), and llama.cpp is recommended for interactive inference.
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ