Efficient and High-Quality Domain-Specific Synthetic Data Generation Pipeline

Your efficient and high-quality domain-specific synthetic data generation pipeline! This tool allows you to generate diverse questions, long-form responses, and high-quality synthetic data for your specific domain.

Features:

Generate diverse and high-quality domain-specific questions and responses.
Customizable pipeline to fit specific requirements for synthetic data generation.

Requirements

Python 3.10
OpenAI API Key (or OpenAISource with serving library support, e.g. vllm, sglang....)

Installation and Setup

Set up environment variables:
- Create a .env file in the root directory of the project and add your OpenAI API Key as follows:
```
OPENAI_API_KEY="your-openai-api-key"
```
- If you're using OpenAISource, refer to VLLM Documentation to set up OpenAI serving. Modify src/openai_calling/call_openai.py according to your configuration.
Install dependencies:
- Run the following command to install the required dependencies:
```
pip install -r requirements.txt
```
Configure the pipeline:
- Edit the settings in config.yml according to your needs. The configuration file provides options to customize the pipeline's behavior.
Data Settings
- chunks: Path to the folder containing data chunks. The data should be in the format: {"id": id, "text": text}.
  - Example: "data/chunks"
- instructions: Path to the folder storing the temporary instruction data.
  - Example: "data/instructions"
- responses: Path to the folder containing the temporary response data that corresponds to the instructions.
  - Example: "data/responses"
- sft_data: Path to the folder where the final synthetic data will be saved.
  - Example: "data/sft_data"
Domain and Language
- domain: Define the domain of your data. This can be any domain (e.g., economic, healthcare, technology).
  - Example: "economic"
- output_language: The language for the generated synthetic data. For example, Vietnamese or English.
  - Example: "Vietnamese"
- instruction_requirements: Additional specifications for the content of instructions. Leave empty for default settings.
- response_requirements: Additional specifications for the content of responses. Leave empty for default settings.
Question and Chunk Settings
- number_questions_per_openai_call: Defines the number of questions to generate per OpenAI API call for each question type.
  - Example: 1
- number_questions_per_chunk: Defines how many questions will be generated for each chunk of data. The total number of questions/chunks would be less than or equal to number_questions_per_openai_call * number_questions_per_chunk. A small number of questions per chunk (less than 10) is recommended for more variety in tasks.
  - Example: 2
Filter Settings

These settings allow you to apply thresholds for filtering instructions and responses based on specific criteria.
- instruction_length: Threshold for the minimum number of characters in an instruction.
  - Example: 10
- instruction_quality: Quality level of the instruction, with values:
  - 'very poor': 0
  - 'poor': 1
  - 'average': 2
  - 'good': 3
  - 'excellent': 4
  - Recommended: 3 (Good).
- instruction_difficulty: Difficulty level of the instruction, with values:
  - 'very easy': 0
  - 'easy': 1
  - 'moderate': 2
  - 'difficult': 3
  - 'very difficult': 4
  - Recommended: 2 (Moderate).
- response_length: Minimum number of characters required for a valid response.
  - Example: 10
- response_length_over_document: Threshold ratio of output length to the document length.
  - Example: 0
- response_quality: Quality level of the response, with values:
  - 'very poor': 0
  - 'poor': 1
  - 'average': 2
  - 'good': 3
  - 'excellent': 4
  - Recommended: 3 (Good).
- filter_instructions_first: Determines if instructions should be filtered before generating the responses.
  - Example: True
Model and Batch Settings
- llm: Specifies the model name to use for data generation. You can use a variant of GPT models, such as gpt-4o-mini or gpt-4o.
  - Example: "gpt-4o-mini"
- batch_size: Defines the number of samples that will be synthesized at the same time during each batch.
  - Example: 5
By modifying these settings in the config.yml file, you can tailor the data generation pipeline to your specific domain and requirements.
Run the pipeline:
- Execute the pipeline with the following command:
```
python openai-magie-pipeline.py
```
Override default configuration for experiments:
- If you'd like to run multiple experiments with a custom configuration, pass a different config file via the --override_default_config flag:
```
python openai-magie-pipeline.py --override_default_config <path_to_your_custom_config.yml>
```

License

This project is licensed under the MIT License - see the LICENSE file for details.

If you find this tool useful, don't forget to give me 1 star 🌟.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
settings		settings
src		src
.gitignore		.gitignore
LICENSE		LICENSE
openai-magie-pipeline.py		openai-magie-pipeline.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient and High-Quality Domain-Specific Synthetic Data Generation Pipeline

Features:

Requirements

Installation and Setup

Data Settings

Domain and Language

Question and Chunk Settings

Filter Settings

Model and Batch Settings

License

About

Releases

Packages

Languages

License

nhungnt7/OpenSynth

Folders and files

Latest commit

History

Repository files navigation

Efficient and High-Quality Domain-Specific Synthetic Data Generation Pipeline

Features:

Requirements

Installation and Setup

Data Settings

Domain and Language

Question and Chunk Settings

Filter Settings

Model and Batch Settings

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages