Skip to content

Commit

Permalink
add instruction-pretrain, add adaptllm benchmark code, update uprise bib
Browse files Browse the repository at this point in the history
  • Loading branch information
cdxeve committed Jun 24, 2024
1 parent 0bdb5d7 commit 471d40f
Show file tree
Hide file tree
Showing 33 changed files with 7,157 additions and 221 deletions.
78 changes: 61 additions & 17 deletions adaptllm/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,82 @@
# Adapting Large Language Models via Reading Comprehension
# Adapting Large Language Models to Domains

This repo contains the model, code and data for our paper [Adapting Large Language Models via Reading Comprehension](https://arxiv.org/pdf/2309.09530.pdf)
This repo contains the model, code and data for our paper [Adapting Large Language Models via Reading Comprehension](https://huggingface.co/papers/2309.09530)

We explore **continued pre-training on domain-specific corpora** for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to **transform large-scale pre-training corpora into reading comprehension texts**, consistently improving prompting performance across tasks in **biomedicine, finance, and law domains**. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B. Moreover, our domain-specific reading comprehension texts enhance model performance even on general benchmarks, indicating potential for developing a general LLM across more domains.
We explore **continued pre-training on domain-specific corpora** for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to **transform large-scale pre-training corpora into reading comprehension texts**, consistently improving prompting performance across tasks in **biomedicine, finance, and law domains**. Our 7B model competes with much larger domain-specific models like BloombergGPT-50B.

## Domain-specific LLMs
Our models of different domains are now available in Huggingface: [biomedicine-LLM](https://huggingface.co/AdaptLLM/medicine-LLM), [finance-LLM](https://huggingface.co/AdaptLLM/finance-LLM) and [law-LLM](https://huggingface.co/AdaptLLM/law-LLM), the performances of our AdaptLLM compared to other domain-specific LLMs are:
### 🤗 [2024/6/21] We release the 2nd version of AdaptLLM at [Instruction-Pretrain](https://huggingface.co/instruction-pretrain) 🤗

**************************** **Updates** ****************************
* 2024/6/22: Released the [benchmarking code](https://github.com/microsoft/LMOps/tree/main/adaptllm).
* 2024/6/21: Released the 2nd version of AdaptLLM at [Instruction-Pretrain](https://huggingface.co/instruction-pretrain).
* 2024/1/16: Our [research paper](https://huggingface.co/papers/2309.09530) has been accepted by ICLR 2024.
* 2023/12/19: Released our [13B base models](https://huggingface.co/AdaptLLM/law-LLM-13B) developed from LLaMA-1-13B.
* 2023/12/8: Released our [chat models](https://huggingface.co/AdaptLLM/law-chat) developed from LLaMA-2-Chat-7B.
* 2023/9/18: Released our [paper](https://huggingface.co/papers/2309.09530), [code](https://github.com/microsoft/LMOps), [data](https://huggingface.co/datasets/AdaptLLM/law-tasks), and [base models](https://huggingface.co/AdaptLLM/law-LLM) developed from LLaMA-1-7B.


# Domain-specific LLMs
Our models of different domains are now available in Huggingface: [biomedicine-LLM](https://huggingface.co/AdaptLLM/medicine-LLM), [finance-LLM](https://huggingface.co/AdaptLLM/finance-LLM) and [law-LLM](https://huggingface.co/AdaptLLM/law-LLM), the performances of our AdaLLM compared to other domain-specific LLMs are:

<p align='center'>
<img src="./comparison.png" width="700">
<img src="./images/comparison.png" width="700">
</p>

## Domain-specific Tasks
To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: [biomedicine-tasks](https://huggingface.co/datasets/AdaptLLM/medicine-tasks), [finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks), and [law-tasks](https://huggingface.co/datasets/AdaptLLM/law-tasks).
We also scale up the model size to 13B, and train from chat models:
* Scale up to 13B: [Biomedicine-LLM-13B](https://huggingface.co/AdaptLLM/medicine-LLM-13B), [Finance-LLM-13B](https://huggingface.co/AdaptLLM/finance-LLM-13B) and [Law-LLM-13B](https://huggingface.co/AdaptLLM/law-LLM-13B)
* Chat models: [Biomedicine-Chat](https://huggingface.co/AdaptLLM/medicine-chat), [Finance-Chat](https://huggingface.co/AdaptLLM/finance-chat) and [Law-Chat](https://huggingface.co/AdaptLLM/law-chat)

## Transfer Raw Corpora into Reading Comprehension
Our method is very **simple**, highly **scalable** and **applicable** to any pre-training corpora.
# Domain-specific Tasks
To easily reproduce our results, we have uploaded the filled-in zero/few-shot input instructions and output completions of each domain-specific task: [biomedicine-tasks](https://huggingface.co/datasets/AdaptLLM/medicine-tasks), [finance-tasks](https://huggingface.co/datasets/AdaptLLM/finance-tasks), and [law-tasks](https://huggingface.co/datasets/AdaptLLM/law-tasks).

### Install Dependencies
# Data processing and benchmarking code
## Install Dependencies
```bash
pip install -r requirements.txt
```
### Start Transferring

## Data: Transfer Raw Corpora into Reading Comprehension
Our method is very **simple**, highly **scalable** and **applicable** to any pre-training corpora.

Try transferring the raw texts in the [data_samples](./data_samples/README.md) folder:
```
python raw2read.py
```

## Benchmark: Evaluate Our models on Domain-specific Tasks
To evaluate our models on the domain-specific tasks:
```bash
# domain name, chosen from ['biomedicine', 'finance', 'law']
DOMAIN='biomedicine'

# hf model names chosen from the following (NOT applicable to chat models):
# ['AdaptLLM/medicine-LLM', 'AdaptLLM/finance-LLM', 'AdaptLLM/law-LLM',
# 'AdaptLLM/medicine-LLM-13B', 'AdaptLLM/finance-LLM-13B', 'AdaptLLM/law-LLM-13B',
# 'instruction-pretrain/medicine-Llama3-8B', instruction-pretrain/finance-Llama3-8B]
MODEL='instruction-pretrain/medicine-Llama3-8B'

# if the model can fit on a single GPU: set MODEL_PARALLEL=False
# elif the model is too large to fit on a single GPU: set MODEL_PARALLEL=True
MODEL_PARALLEL=False

# number of GPUs, chosen from [1,2,4,8]
N_GPU=8

# whether to add_bos_token, this is set to False for AdaptLLM, and True for instruction-pretrain
add_bos_token=True

bash scripts/inference.sh ${DOMAIN} ${MODEL} ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}
```
We include detailed instructions [here](./scripts/README.md)

## Citation
```bibtex
@inproceedings{AdaptLLM,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
url={https://arxiv.org/abs/2309.09530},
year={2023},
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
```
26 changes: 26 additions & 0 deletions adaptllm/configs/inference.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
model_name: ??? # hf model name
task_name: ??? # a string concatenating all the evaluation tasks using `+`
output_dir: /tmp/output # for saving the prediction files
res_dir: /tmp/res # for saving the evaluation scores of each task
max_length: 2048 # max length of tokenizer
generate_max_len: 100 # for text completion task
n_tokens: 2048 # max length of (prompt + task input+ task output)
cache_dir: /tmp/cache # for caching hf models and datasets
add_bos_token: ???
model_parallel: ??? # True or false

dataset_reader:
_target_: src.dataset_readers.inference_dsr.InferenceDatasetReader
model_name: ${model_name}
task_name: ${task_name}
n_tokens: ${n_tokens}
cache_dir: ${cache_dir}
max_length: ${max_length}
generate_max_len: ${generate_max_len}
add_bos_token: ${add_bos_token}
model:
_target_: src.models.model.get_model
pretrained_model_name_or_path: ${model_name}
cache_dir: ${cache_dir}
trust_remote_code: true
model_parallel: ${model_parallel}
19 changes: 0 additions & 19 deletions adaptllm/data_samples/output-read-compre/0.txt

This file was deleted.

27 changes: 0 additions & 27 deletions adaptllm/data_samples/output-read-compre/1.txt

This file was deleted.

6 changes: 0 additions & 6 deletions adaptllm/data_samples/output-read-compre/10.txt

This file was deleted.

Loading

0 comments on commit 471d40f

Please sign in to comment.