Data Generation Codebase for OpenAI GPT

This repository contains a set of Python scripts and utilities designed to facilitate the generation and processing of datasets for training machine learning models, specifically using OpenAI's GPT models.

Overview

The codebase is structured to handle various steps in the dataset preparation process, from formatting prompts to sending API calls and processing responses. The tools included are ideal for researchers and developers looking to automate the preparation of training data for LLM SFT and RLHF training.

Directory Structure

/jsons - Directory where all script outputs in JSON format are stored.
/example_runs - Contains example bash scripts demonstrating different use cases.

Scripts

Data Formatting

prompt_format.py: Formats user input into a GPT-compatible format.
alpaca_formatting.py: Formats data into a structure suitable for training purposes (Alpaca format).
split_response.py: Splits prompts and responses for specific use cases depending on the stage of data generation.
trim_response.py: Trims and cleans the messages.
preference_format.py: Formats data into a format suitable for RLHF by marking chosen/rejected features suitable for RLHF tuning.

API Interaction

response_gpt.py: Handles API calls to generate prompts/responses using multithreading for efficiency.

Example Bash Scripts

Located in /example_runs, these scripts demonstrate practical applications of the tools:

Bash script for generating 'rejected' data for the RLHF dataset.
A script for generating multiple responses for a single prompt.
A script for generating prompts first, then responses for a QA dataset.

Setup and Execution

Installation

Clone the repository:

git clone https://github.com/lightmatmul/Data-Generation-with-OpenAI.git
cd Data-Generation-with-OpenAI```

Create and activate a virtual environment:

python -m venv env
source env/bin/activate  # On Windows, use `env\Scripts\activate```

Install the required packages:
```
pip install -r requirements.txt```
```

API key setup

Ensure you have an OpenAI API key, which you will need to insert into response_gpt.py.

Permissions

Before running the scripts, set the necessary permissions using the following command for each bash script:

chmod +x script.sh
# Then run:
./script.sh

Name	Name	Last commit message	Last commit date
Latest commit lightmatmul Update README.md Jun 23, 2024 33a5398 · Jun 23, 2024 History 17 Commits
example_usage	example_usage	initial commit	Jun 23, 2024
jsons	jsons	initial commit	Jun 23, 2024
src	src	initial commit	Jun 23, 2024
README.md	README.md	Update README.md	Jun 23, 2024
requirements.txt	requirements.txt	initial commit	Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Generation Codebase for OpenAI GPT

Overview

Directory Structure

Scripts

Data Formatting

API Interaction

Example Bash Scripts

Setup and Execution

Installation

API key setup

Permissions

About

Releases

Packages

Languages

lightmatmul/Data-Generation-with-OpenAI

Folders and files

Latest commit

History

Repository files navigation

Data Generation Codebase for OpenAI GPT

Overview

Directory Structure

Scripts

Data Formatting

API Interaction

Example Bash Scripts

Setup and Execution

Installation

API key setup

Permissions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages