Skip to content

Commit

Permalink
[Major] conversational prompting
Browse files Browse the repository at this point in the history
  • Loading branch information
huyiwen committed May 24, 2024
1 parent 43b0689 commit 7e6ea5c
Show file tree
Hide file tree
Showing 13 changed files with 872 additions and 342 deletions.
31 changes: 18 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ Training

Utilization

- **Comprehensive Evaluation:** We support 51 commonly used datasets.
- **In-Context Learning:** We support various ICL strategies, including `KATE`, `GlobalE`, and `APE`.
- **Chain-of-Thought:** For some datasets, we support three types of CoT evaluation: `base`, `least-to-most`, and `pal`.
- **Evaluation Methods:** We currently support three evaluation methods for multiple choice questions or generation questions.
- **Prefix Caching:** By caching the `past_key_value` of prefix, we can speed up local inference by up to 6x.
- **Comprehensive Evaluation:** 56+ commonly used datasets and benchmarks in evaluating LLMs.
- **Evaluation Methods:** Accurately reproduce results from original papers of OpenAI, LLaMA, Mistral, and other models.
- **In-Context Learning:** We support various ICL strategies, including [`KATE`](https://aclanthology.org/2022.deelio-1.10/), [`GlobalE`](https://aclanthology.org/2022.acl-long.556/), and [`APE`](https://arxiv.org/abs/2211.01910).
- **Chain-of-Thought:** For some datasets, we support three types of CoT evaluation: `base`, [`least-to-most`](https://arxiv.org/abs/2205.10625), and [`pal`](https://arxiv.org/abs/2211.10435).
- **Prefix Caching:** By managing the KV Cache of prefixes, we can speed up local inference by up to 6x.
- **vLLM and Flash Attention Support:** We also support [`vLLM`](https://github.com/vllm-project/vllm) and [`Flash Attention`](https://github.com/Dao-AILab/flash-attention) for efficient inference.
- **Quantization:** BitsAndBytes and GPTQ quantization are supported.

Expand Down Expand Up @@ -85,40 +85,43 @@ Alternatively, you can use the following preset bash scripts to train your model

### Merging Tokenizer

If you want to pre-train your models on corpora with languages or tokens not well-supported in original language mdoels(e.g., LLaMA), we provide the tokenizer merging function to expand the vocabulary based on the corpora by using [sentencepiece](https://github.com/google/sentencepiece). You can check [merge_tokenizer.py](training/merge_tokenizer.py) for detailed information. Please follow the guide in [Pre-train](training/README.md##2-continual-pre-training-with-your-own-corpora).
If you want to pre-train your models on corpora with languages or tokens not well-supported in original language mdoels(e.g., LLaMA), we provide the tokenizer merging function to expand the vocabulary based on the corpora by using [sentencepiece](https://github.com/google/sentencepiece). You can check [merge_tokenizer.py](training/merge_tokenizer.py) for detailed information. Please follow the guide in [Pre-train](https://github.com/RUCAIBox/LLMBox/tree/main/training#2-continual-pre-training-with-your-own-corpora).

```bash
bash bash/run_7b_pt.sh
```

### Merging Datasets

If you want to train your models with a mix of multiple datasets, you can pass a list of dataset files or names to LLMBox. LLMBox will transfer each file or name into a PTDataset or SFTDataset, and merge them together to construct a combined dataset. You can also set the merging ratio of each dataset by passing a list of floats to LLMBox. Please follow the guide in [Merge Dataset](training/README.md##3-merging-different-datasets-with-designated-ratios-for-training).
If you want to train your models with a mix of multiple datasets, you can pass a list of dataset files or names to LLMBox. LLMBox will transfer each file or name into a PTDataset or SFTDataset, and merge them together to construct a combined dataset. You can also set the merging ratio of each dataset by passing a list of floats to LLMBox. Please follow the guide in [Merge Dataset](https://github.com/RUCAIBox/LLMBox/tree/main/training#3-merging-different-datasets-with-designated-ratios-for-training).

```bash
bash bash/run_7b_hybrid.sh
```

### Self-Instruct and Evol-Instruct

Since manually creating instruction data of high qualities to train the model is very time-consuming and labor-intensive, Self-Instruct and Evol-Instruct are proposed to create large amounts of instruction data with varying levels of complexity using LLM instead of humans. LLMBox support both Self-Instruct and Evol-Instruct to augment or enhance the input data files. Please follow the guide in [Self-Insturct and Evol-Instruct](training/README.md#8-self-instruct-and-evol-instruct-for-generation-instructions)
Since manually creating instruction data of high qualities to train the model is very time-consuming and labor-intensive, Self-Instruct and Evol-Instruct are proposed to create large amounts of instruction data with varying levels of complexity using LLM instead of humans. LLMBox support both Self-Instruct and Evol-Instruct to augment or enhance the input data files. Please follow the guide in [Self-Insturct and Evol-Instruct](https://github.com/RUCAIBox/LLMBox/tree/main/training#8-self-instruct-and-evol-instruct-for-generation-instructions)

```bash
python self_instruct/self_instruct.py --seed_tasks_path=seed_tasks.jsonl
```

For more details, view the [training](./training/README.md) documentation.
For more details, view the [training](https://github.com/RUCAIBox/LLMBox/tree/main/training) documentation.

## Utilization

We provide a broad support on Huggingface models, OpenAI, Anthropic, QWen and models for further utilization. Currently a total of 51 commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](./utilization/README.md) documentation.
We provide a broad support on Huggingface models (e.g. `LLaMA-3`, `Mistral`), OpenAI, Anthropic, QWen and other OpenAI-compatible models for further utilization.

Currently a total of 56+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. For a full list of supported models and datasets, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.

```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
python inference.py \
-m llama-2-7b-hf \
-d mmlu agieval:[English] \
--model_type instruction \
--num_shot 5 \
--cuda 0 \
--ranking_type ppl_no_option
```

Expand Down Expand Up @@ -243,7 +246,7 @@ python inference.py -m model -d dataset --kate # --globale or --ape
python inference.py -m model -d dataset --cot least_to_most # --base or --pal
```

For a more detailed instruction on model utilization, view the [utilization](./utilization/README.md) documentation.
For a more detailed instruction on model utilization, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.

<!-- For a full list of evaluation results, view our paper. -->

Expand All @@ -255,12 +258,14 @@ We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

You can follow [model customization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization#customize-model) and [dataset customization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization#customize-dataset) to add new model provider or dataset.

Make sure to format your code with `yapf --style .style.cfg` and `isort` before submitting a PR.


## The Team

LLMBox is developed and maintained by [AI Box](http://aibox.ruc.edu.cn/).
LLMBox is developed and maintained by [AI Box](http://aibox.ruc.edu.cn/). See more details in [change log](https://github.com/RUCAIBox/LLMBox/tree/main/utilization#change-log)

## License

Expand Down
32 changes: 28 additions & 4 deletions utilization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
- [Evaluation Arguments](#evaluation-arguments)
- [Supported Models](#supported-models)
- [Customize Model](#customize-model)
- [Customize Chat Template](#customize-chat-template)
- [Supported Datasets](#supported-datasets)
- [Customize Dataset](#customize-dataset)
- [Change Log](#change-log)

## Usage

Expand Down Expand Up @@ -123,7 +125,7 @@ Generation arguments and quantization options:
--system_prompt SYSTEM_PROMPT, -sys SYSTEM_PROMPT
The system prompt for chat-based models
--chat_template CHAT_TEMPLATE
The chat template for huggingface chat-based models
The chat template for local chat-based models. Support model default chate template (choose from 'base', 'llama2', 'chatml', 'zephyr', 'phi3', 'llama3', ...) or standard HuggingFace tokenizers chat template
--bnb_config BNB_CONFIG
JSON string for BitsAndBytesConfig parameters.
--load_in_8bit [LOAD_IN_8BIT]
Expand Down Expand Up @@ -169,9 +171,8 @@ You can evaluate datasets sequentially in a single run when they require similar
--example_set EXAMPLE_SET
The set name for demonstration, supporting slice,
e.g., train, dev, train[:10] (default: None)
--instance_format INSTANCE_FORMAT, -fmt INSTANCE_FORMAT
The format to format the `source` and `target` for
each instance (default: {source}{target})
--instruction INSTRUCTION
The format to format the instruction for each instance. Either f-string or jinja2 format is supported. E.g., 'Answer the following question: {question}\nAnswer:'"
--num_shots NUM_SHOTS, -shots NUM_SHOTS
The few-shot number for demonstration (default: 0)
--max_example_tokens MAX_EXAMPLE_TOKENS
Expand Down Expand Up @@ -385,6 +386,21 @@ class NewModel(Model):

And then, you should register your model in the [`load`](model/load.py) file.

## Customize Chat Template

Chat templates are used to formatting conversational messages to text input for local chat-based models.

```bash
python inference.py -m Meta-Llama-3-8B-Instruct -d gsm8k --model_type chat --chat_template llama3 -shots 8 -sys "You are a helpful assistant."
```

You don't need to specify the chat template for hosted models.

```bash
python inference.py -m gpt-3.5-turbo -d gsm8k --model_type chat -shots 8 -sys "You are a helpful assistant."
```

You can customize the [chat template](https://github.com/RUCAIBox/LLMBox/blob/main/utilization/chat_templates.py) for local chat-based models. We provide a set of chat templates for different models. You can specify a jinja2 chat template with the `--chat_template` argument. It works in the same way as the [tokenizers](https://huggingface.co/docs/transformers/main/en/chat_templating).


## Supported Datasets
Expand Down Expand Up @@ -1053,3 +1069,11 @@ def format_instance(self, instance):
To evaluate a pre-trained model that lacks instruction-following capabilities, you can provide an instruction explicitly by assigning a completion instruction to the model as follows: instruction = "{question}".

See [`Dataset`](dataset/dataset.py) for more details.

## Change Log

- **May 24, 2024**: Chat format support including conversational few-shot and system prompts.
- **May 10, 2024**: New instruction formatting using f-string and jinja2.
- **May 7, 2024**: Bump openai and vllm version.
- **Apr 16, 2024**: Full support for KV caching.
- **March 18, 2024**: First release of LLMBox.
88 changes: 88 additions & 0 deletions utilization/chat_templates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# sources: https://github.com/huggingface/chat-ui/blob/main/PROMPTS.md

DEFAULT_CHAT_TEMPLATE = (
"{% macro add(role, msg) -%}"
"{{ seq[role + '_start'] }}"
"{{ msg | smart_space(auto_leading_space, seq[role + '_start']) }}"
"{{ seq[role + '_end'] }}"
"{%- endmacro %}"
"{% for message in messages %}"
"{{ add(message['role'], message['content']) }}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ seq['assistant_start'] }}"
"{% endif %}"
)

DEFAULT_CHAT_CONFIGS = {
"base": {
"system_start": "",
"system_end": "\n\n",
"user_start": "",
"user_end": "",
"assistant_start": "",
"assistant_end": "\n\n",
"auto_leading_space": True,
"default_stops": ["\n"],
},
"llama2": {
"system_start": "<s>[INST] <<SYS>>\n",
"system_end": "\n<</SYS>>\n\n",
"user_start": "",
"user_end": " [/INST] ",
"assistant_start": "",
"assistant_end": " </s><s>[INST] ",
"auto_leading_space": True,
"default_stops": [""],
},
"chatml": {
"system_start": "<|im_start|>system\n",
"system_end": "<|im_end|>\n",
"user_start": "<|im_start|>user\n",
"user_end": "<|im_end|>\n",
"assistant_start": "<|im_start|>assistant\n",
"assistant_end": "<|im_end|>\n",
"auto_leading_space": True,
"default_stops": ["<|im_end|>"],
},
"zephyr": {
"system_start": "<|system|>\n",
"system_end": "</s>\n",
"user_start": "<|user|>\n",
"user_end": "</s>\n",
"assistant_start": "<|assistant|>\n",
"assistant_end": "</s>\n",
"auto_leading_space": True,
"default_stops": ["</s>"],
},
"phi3": {
"system_start": "<|system|>\n",
"system_end": "<|end|>\n",
"user_start": "<|user|>\n",
"user_end": "<|end|>\n",
"assistant_start": "<|assistant|>\n",
"assistant_end": "<|end|>\n",
"auto_leading_space": True,
"default_stops": ["<|end|>"],
},
"llama3": {
"system_start": "<|start_header_id|>system<|end_header_id|>\n\n",
"system_end": "<|eot_id|>",
"user_start": "<|start_header_id|>user<|end_header_id|>\n\n",
"user_end": "<|eot_id|>",
"assistant_start": "<|start_header_id|>assistant<|end_header_id|>\n\n",
"assistant_end": "<|eot_id|>",
"auto_leading_space": True,
"default_stops": ["<|eot_id|>"],
},
"alpaca": {
"system_start": "### Input:\n",
"system_end": "\n\n",
"user_start": "### Instruction:\n",
"user_end": "\n\n",
"assistant_start": "### Response:\n",
"assistant_end": "\n\n",
"auto_leading_space": True,
"default_stops": ["###"],
}
}
Loading

0 comments on commit 7e6ea5c

Please sign in to comment.