Skip to content

Commit

Permalink
modify readme, fix preprocess_dataset.py, add setup steps in notebook
Browse files Browse the repository at this point in the history
Signed-off-by: kta-intel <[email protected]>
  • Loading branch information
kta-intel committed Jan 19, 2024
1 parent 575e243 commit 11900e6
Show file tree
Hide file tree
Showing 3 changed files with 113 additions and 47 deletions.
36 changes: 2 additions & 34 deletions openfl-tutorials/experimental/LLM/neuralchat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,41 +10,9 @@ Intel's [Neural-Chat-v3](https://huggingface.co/Intel/neural-chat-7b-v3) is a fi

Additional details in the fine-tuning can be found [here](https://medium.com/intel-analytics-software/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).

## 3. Installing dependencies
## 3. Running the tutorial

In this tutorial, we will be fine-tuning Intel's neuralchat-7b model using OpenFL and Intel(R) Extension for Transformers

Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL

```sh
pip install intel-extension-for-transformers==1.2.2
pip install openfl
```

From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework

```sh
pip install -r requirements_neural_chat.txt
pip install -r requirements_workflow_interface.txt
```

## 4. Acquiring and preprocessing dataset

We can clone the dataset directly from the MedQuAD repository

```sh
git clone https://github.com/abachaa/MedQuAD.git
```

From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline

```sh
python preprocess_dataset.py
```

## 5: Running the tutorial

You are now ready to follow along in the tutorial notebook: `Workflow_Interface_NeuralChat.ipynb`
Follow along step-by-step in the [notebook](Workflow_Interface_NeuralChat.ipynb) to learn how to fine-tune neural-chat-7b on the MedQuAD dataset

## Reference:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@
"id": "bd059520",
"metadata": {},
"source": [
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \r\n",
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \n",
"\n",
"We will fine-tune **Intel's [neural-chat-7b]https://huggingface.co/Intel/neural-chat-7b-v31)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers) library with added features for optimal performance on Intel hardware."
"We will fine-tune **Intel's [neural-chat-7b](https://huggingface.co/Intel/neural-chat-7b-v1)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers) library with added features for optimal performance on Intel hardware.."
]
},
{
Expand All @@ -36,6 +36,103 @@
"The workflow interface is a new way of composing federated learning expermients with OpenFL. It was borne through conversations with researchers and existing users who had novel use cases that didn't quite fit the standard horizontal federated learning paradigm. "
]
},
{
"cell_type": "markdown",
"id": "df198264-eba9-4baa-b585-6d8530dbc83c",
"metadata": {},
"source": [
"## Initial Setup\n",
"### Installing dependencies\n",
"Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56f4628e-7a1b-4576-bf6e-637757b2726d",
"metadata": {},
"outputs": [],
"source": [
"!pip install intel-extension-for-transformers==1.2.2\n",
"!pip install openfl"
]
},
{
"cell_type": "markdown",
"id": "124ae236-2e33-4349-9979-f506d796276d",
"metadata": {},
"source": [
"From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63207a15-e1e3-4b7a-8a85-53618f8ec8ef",
"metadata": {},
"outputs": [],
"source": [
"!pip install -r requirements_neural_chat.txt\n",
"!pip install -r requirements_workflow_interface.txt"
]
},
{
"cell_type": "markdown",
"id": "b8c24994-1b30-4f03-82ba-5a58bb347b70",
"metadata": {},
"source": [
"### Acquiring and preprocessing dataset\n",
"We can clone the dataset directly from the [MedQuAD repository](https://github.com/abachaa/MedQuAD)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a6674c17-1652-4e87-a885-bc10bf3624c6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cloning into 'MedQuAD'...\n",
"remote: Enumerating objects: 11310, done.\u001b[K\n",
"remote: Counting objects: 100% (9/9), done.\u001b[K\n",
"remote: Compressing objects: 100% (9/9), done.\u001b[K\n",
"remote: Total 11310 (delta 3), reused 1 (delta 0), pack-reused 11301\u001b[K\n",
"Receiving objects: 100% (11310/11310), 11.01 MiB | 33.16 MiB/s, done.\n",
"Resolving deltas: 100% (6806/6806), done.\n"
]
}
],
"source": [
"!git clone https://github.com/abachaa/MedQuAD.git"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ca3dd737-1882-4a1c-b95b-827a8110c307",
"metadata": {},
"outputs": [],
"source": [
"from preprocess_dataset import xml_to_json\n",
"\n",
"# User input for folder paths\n",
"input_base_folder = \"./MedQuAD/\"\n",
"output_folder = \"./\"\n",
"\n",
"xml_to_json(input_base_folder, output_folder)"
]
},
{
"cell_type": "markdown",
"id": "98014201-01b6-4726-b483-6d7101a3aa51",
"metadata": {},
"source": [
"From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline"
]
},
{
"cell_type": "markdown",
"id": "fc8e35da",
Expand All @@ -60,9 +157,6 @@
" DataCollatorForSeq2Seq,\n",
")\n",
"\n",
"from transformers.modeling_utils import unwrap_model\n",
"from transformers.trainer_utils import is_main_process\n",
"\n",
"from intel_extension_for_transformers.neural_chat.config import (\n",
" ModelArguments,\n",
" DataArguments,\n",
Expand Down
20 changes: 12 additions & 8 deletions openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
import xml.etree.ElementTree as ET
# Copyright (C) 2020-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import xml.etree.ElementTree as et
import json
import os
import math


def xml_to_json(input_base_folder, output_folder):

if not os.path.exists(input_base_folder):
raise SystemExit(f"The folder '{input_base_folder}' does not exist.")

train_data = []
test_data = []
train_count, test_count = 0, 0
Expand Down Expand Up @@ -38,9 +46,10 @@ def xml_to_json(input_base_folder, output_folder):
f.write(f"Training data pairs: {train_count}\n")
f.write(f"Test data pairs: {test_count}\n")


def process_xml_file(folder, xml_file):
xml_path = os.path.join(folder, xml_file)
tree = ET.parse(xml_path)
tree = et.parse(xml_path)
root = tree.getroot()

data = []
Expand All @@ -65,12 +74,7 @@ def process_xml_file(folder, xml_file):

return data, count


def save_json(data, filename):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)

# User input for folder paths
input_base_folder = "./MedQuAD/"
output_folder = "./"

xml_to_json(input_base_folder, output_folder)

0 comments on commit 11900e6

Please sign in to comment.