Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 1.18 KB

synthetic_data.md

File metadata and controls

19 lines (10 loc) · 1.18 KB

Synthetic data

Synthetic data generation is not a new technique in the AI world. Early methods relied on statistical techniques like bootstrapping, smoothing, and imputation. With the advent of machine learning techniques, more sophisticated methods emerged in the 2010s: the Generative Adversarial Networks (GANs) and Variation Autoencoders. Now, with the advent of LLMs, synthetic data generation has further advanced with innovative and more effective techniques.

This GitHub repository (https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data) is a true gem. It offers a great selection of resources like methods surveys, relevant blog posts, and relevant papers to read when working on use cases (like math reasoning, code generation, vision, and language, etc.)

Methods

InstructLab by IBM: https://research.ibm.com/blog/LLM-generated-data

Hugging Face synthetic datasets https://huggingface.co/blog/davanstrien/self-instruct

Datasets

Repository with synthetic datasets https://github.com/davanstrien/awesome-synthetic-datasets

Cosmopedia and synthetic datasets https://huggingface.co/blog/cosmopedia

Youtube video for StarCode and StarCode2 https://www.youtube.com/watch?v=IyI8pXbQzbw