modify readme, fix preprocess_dataset.py, add setup steps in notebook

Signed-off-by: kta-intel <[email protected]>
securefederatedai · Jan 19, 2024 · 11900e6 · 11900e6
1 parent 575e243
commit 11900e6
Show file tree

Hide file tree

Showing 3 changed files with 113 additions and 47 deletions.
diff --git a/openfl-tutorials/experimental/LLM/neuralchat/README.md b/openfl-tutorials/experimental/LLM/neuralchat/README.md
@@ -10,41 +10,9 @@ Intel's [Neural-Chat-v3](https://huggingface.co/Intel/neural-chat-7b-v3) is a fi
 
 Additional details in the fine-tuning can be found [here](https://medium.com/intel-analytics-software/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).
 
-## 3. Installing dependencies
+## 3. Running the tutorial
 
-In this tutorial, we will be fine-tuning Intel's neuralchat-7b model using OpenFL and Intel(R) Extension for Transformers
-
-Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL
-
-```sh
-pip install intel-extension-for-transformers==1.2.2
-pip install openfl
-```
-
-From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework
-
-```sh
-pip install -r requirements_neural_chat.txt
-pip install -r requirements_workflow_interface.txt
-```
-
-## 4. Acquiring and preprocessing dataset
-
-We can clone the dataset directly from the MedQuAD repository
-
-```sh
-git clone https://github.com/abachaa/MedQuAD.git
-```
-
-From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline
-
-```sh
-python preprocess_dataset.py
-```
-
-## 5: Running the tutorial
-
-You are now ready to follow along in the tutorial notebook: `Workflow_Interface_NeuralChat.ipynb`
+Follow along step-by-step in the [notebook](Workflow_Interface_NeuralChat.ipynb) to learn how to fine-tune neural-chat-7b on the MedQuAD dataset
 
 ## Reference:
 ```

diff --git a/openfl-tutorials/experimental/LLM/neuralchat/Workflow_Interface_NeuralChat.ipynb b/openfl-tutorials/experimental/LLM/neuralchat/Workflow_Interface_NeuralChat.ipynb
@@ -15,9 +15,9 @@
    "id": "bd059520",
    "metadata": {},
    "source": [
-    "In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \r\n",
+    "In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \n",
     "\n",
-    "We will fine-tune **Intel's [neural-chat-7b]https://huggingface.co/Intel/neural-chat-7b-v31)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers)  library with added features for optimal performance on Intel hardware."
+    "We will fine-tune **Intel's [neural-chat-7b](https://huggingface.co/Intel/neural-chat-7b-v1)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers)  library with added features for optimal performance on Intel hardware.."
    ]
   },
   {
@@ -36,6 +36,103 @@
     "The workflow interface is a new way of composing federated learning expermients with OpenFL. It was borne through conversations with researchers and existing users who had novel use cases that didn't quite fit the standard horizontal federated learning paradigm. "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "df198264-eba9-4baa-b585-6d8530dbc83c",
+   "metadata": {},
+   "source": [
+    "## Initial Setup\n",
+    "### Installing dependencies\n",
+    "Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "56f4628e-7a1b-4576-bf6e-637757b2726d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install intel-extension-for-transformers==1.2.2\n",
+    "!pip install openfl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "124ae236-2e33-4349-9979-f506d796276d",
+   "metadata": {},
+   "source": [
+    "From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63207a15-e1e3-4b7a-8a85-53618f8ec8ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -r requirements_neural_chat.txt\n",
+    "!pip install -r requirements_workflow_interface.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8c24994-1b30-4f03-82ba-5a58bb347b70",
+   "metadata": {},
+   "source": [
+    "### Acquiring and preprocessing dataset\n",
+    "We can clone the dataset directly from the [MedQuAD repository](https://github.com/abachaa/MedQuAD)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a6674c17-1652-4e87-a885-bc10bf3624c6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cloning into 'MedQuAD'...\n",
+      "remote: Enumerating objects: 11310, done.\u001b[K\n",
+      "remote: Counting objects: 100% (9/9), done.\u001b[K\n",
+      "remote: Compressing objects: 100% (9/9), done.\u001b[K\n",
+      "remote: Total 11310 (delta 3), reused 1 (delta 0), pack-reused 11301\u001b[K\n",
+      "Receiving objects: 100% (11310/11310), 11.01 MiB | 33.16 MiB/s, done.\n",
+      "Resolving deltas: 100% (6806/6806), done.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!git clone https://github.com/abachaa/MedQuAD.git"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ca3dd737-1882-4a1c-b95b-827a8110c307",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from preprocess_dataset import xml_to_json\n",
+    "\n",
+    "# User input for folder paths\n",
+    "input_base_folder = \"./MedQuAD/\"\n",
+    "output_folder = \"./\"\n",
+    "\n",
+    "xml_to_json(input_base_folder, output_folder)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98014201-01b6-4726-b483-6d7101a3aa51",
+   "metadata": {},
+   "source": [
+    "From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "fc8e35da",
@@ -60,9 +157,6 @@
     "    DataCollatorForSeq2Seq,\n",
     ")\n",
     "\n",
-    "from transformers.modeling_utils import unwrap_model\n",
-    "from transformers.trainer_utils import is_main_process\n",
-    "\n",
     "from intel_extension_for_transformers.neural_chat.config import (\n",
     "    ModelArguments,\n",
     "    DataArguments,\n",

diff --git a/openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py b/openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py
@@ -1,9 +1,17 @@
-import xml.etree.ElementTree as ET
+# Copyright (C) 2020-2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import xml.etree.ElementTree as et
 import json
 import os
 import math
 
+
 def xml_to_json(input_base_folder, output_folder):
+
+    if not os.path.exists(input_base_folder):
+        raise SystemExit(f"The folder '{input_base_folder}' does not exist.")
+
     train_data = []
     test_data = []
     train_count, test_count = 0, 0
@@ -38,9 +46,10 @@ def xml_to_json(input_base_folder, output_folder):
         f.write(f"Training data pairs: {train_count}\n")
         f.write(f"Test data pairs: {test_count}\n")
 
+
 def process_xml_file(folder, xml_file):
     xml_path = os.path.join(folder, xml_file)
-    tree = ET.parse(xml_path)
+    tree = et.parse(xml_path)
     root = tree.getroot()
 
     data = []
@@ -65,12 +74,7 @@ def process_xml_file(folder, xml_file):
 
     return data, count
 
+
 def save_json(data, filename):
     with open(filename, 'w', encoding='utf-8') as f:
         json.dump(data, f, ensure_ascii=False, indent=4)
-
-# User input for folder paths
-input_base_folder = "./MedQuAD/"
-output_folder = "./"
-
-xml_to_json(input_base_folder, output_folder)