Releases: argilla-io/distilabel
1.2.4
What's Changed
- Update
InferenceEndpointsLLM
to usechat_completion
method by @gabrielmbmb in #815
Full Changelog: 1.2.3...1.2.4
1.2.3
What's Changed
- Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in #786
- Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in #791
- docs: update script for issue dashboard by @sdiazlor in #775
- Fix 404 model not found for private Serverless IE by @dvsrepo in #806
New Contributors
- @Hassaan-Qaisar made their first contribution in #786
Full Changelog: 1.2.2...1.2.3
1.2.2
What's Changed
- Fix passing
input
toformat_output
function by @gabrielmbmb in #781
Full Changelog: 1.2.1...1.2.2
1.2.1
What's Changed
- Fix docs for distiset.save_to_disk kwargs by @fpreiss in #745
- docs: change references by @sdiazlor in #754
- Fix
response_format
forTogetherLLM
andAnyScaleLLM
by @gabrielmbmb in #764
New Contributors
Full Changelog: 1.2.0...1.2.1
1.2.0
✨ Release highlights
Structured generation with instructor
, InferenceEndpointsLLM
now supports structured generation and StructuredGeneration
task
-
instructor
has been integrated bringing support for structured generation withOpenAILLM
,AnthropicLLM
,LiteLLM
,MistralLLM
,CohereLLM
andGroqLLM
:Structured generation with `instructor` example
from typing import List from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field class Node(BaseModel): id: int label: str color: str class Edge(BaseModel): source: int target: int label: str color: str = "black" class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., default_factory=list) edges: List[Edge] = Field(..., default_factory=list) with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ] load_dataset = LoadDataFromDicts( name="load_instructions", data=[ { "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.", "instruction": f"{question}", } for question in sample_questions ], ) text_generation = TextGeneration( name="knowledge_graph_generation", llm=MistralLLM( model="open-mixtral-8x22b", structured_output={"schema": KnowledgeGraph} ), ) load_dataset >> text_generation
-
InferenceEndpointsLLM
now supports structured generation -
New
StructuredGeneration
task that allows defining the schema of the structured generation per input row.
New tasks for generating datasets for training embedding models
sentence-transformers
v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!
- New
GenerateSentencePair
task that allows to generate apositive
sentence for an inputanchor
, and optionally also anegative
sentence. The tasks allows creating different kind of data specifying theaction
to perform with respect to theanchor
: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer. - Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
EmbeddingTaskGenerator
which allows generating new embedding-related tasks using anLLM
.GenerateTextRetrievalData
which allows creating text retrieval data with anLLM
.GenerateShortTextMatchingData
which allows creating short texts matching the input data.GenerateLongTextMatchingData
which allows creating long texts matching the input data.GenerateTextClassificationData
which allows creating text classification data from the input data.MonolingualTripletGenerator
which allows creating monolingual triplets from the input data.BitextRetrievalGenerator
which allows creating bitext retrieval data from the input data.
New Step
s for loading data from different sources and saving/loading Distiset
to disk
We've added a few new steps allowing to load data from different sources:
LoadDataFromDisk
allows loading aDistiset
ordatasets.Dataset
that was previously saved using thesave_to_disk
method.LoadDataFromFileSystem
allows loading adatasets.Dataset
from a file system.
Thanks to @rasdani for helping us testing this new tasks!
In addition, we have added save_to_disk
method to Distiset
akin to datasets.Dataset.save_to_disk
, that allows saving the generated distiset to disk, along with the pipeline.yaml
and pipeline.log
.
`save_to_disk` example
from distilabel.pipeline import Pipeline
with Pipeline(name="my-pipeline") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.run(...)
distiset.save_to_disk(dataset_path="my-distiset")
MixtureOfAgentsLLM
implementation
We've added a new LLM
called MixtureOfAgentsLLM
derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM
allows generating improved outputs thanks to the collective expertise of several LLM
s.
`MixtureOfAgentsLLM` example
from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM
llm = MixtureOfAgentsLLM(
aggregator_llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
proposers_llms=[
InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
InferenceEndpointsLLM(
model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
),
InferenceEndpointsLLM(
model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
),
],
rounds=2,
)
llm.load()
output = llm.generate(
inputs=[
[
{
"role": "user",
"content": "My favorite witty review of The Rings of Power series is this: Input:",
}
]
]
)
Saving cache and passing batches to GlobalStep
s optimizations
- The cache logic of the
_BatchManager
has been improved to incrementally update the cache making the process much faster. - The data of the input batches of the
GlobalStep
s will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration offsspec
, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.
BasePipeline
and _BatchManager
refactor
The logic around BasePipeline
and _BatchManager
has been refactored, which will make it easier to implement new pipelines in the future.
Added ArenaHard
as an example of how to use distilabel
to implement a benchmark
distilabel
can be easily used to create an LLM
benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel
: Arena Hard
📚 Improved documentation structure
We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.
What's Changed
- Add
prometheus.md
by @alvarobartt in #656 - Reduce time required to execute
_cache
method by @gabrielmbmb in #672 - [DOCS] Update theme styles and images by @leiyre in #667
- Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
- Add
CITATION.cff
by @alvarobartt in #677 - Deprecate conversation support in
TextGeneration
in favour ofChatGeneration
by @alvarobartt in #676 - Add functionality to load/save distisets to/from disk by @plaguss in #673
- Integration instructor by @plaguss in #654
- Fix docs of saving/loading distiset from disk by @plaguss in https://githu...
1.1.1
1.1.0
Distilabel 1.1.0
Two new tasks implemented!
Genstruct
task (#600)
You can now use Genstruct
task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct
with Pipeline(name="harry-potter-genstruct") as pipeline:
load_hub_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"title": "Harry Potter and the Sorcerer's Stone",
"content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
},
{
"title": "Harry Potter and the Chamber of Secrets",
"content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
},
],
)
task = Genstruct(
name="task",
llm=TransformersLLM(
model="NousResearch/Genstruct-7B",
torch_dtype="float16",
chat_template="{{ messages[0]['content'] }}",
device="cuda:0",
),
num_generations=2,
group_generations=False,
output_mappings={"model_name": "model"},
)
PrometheusEval
task (#610)
A new PrometheusEval
task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":
from distilabel.steps.tasks import PrometheusEval
with Pipeline(name="prometheus") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
output_mappings={"prompt": "instruction", "completion": "generation"},
)
task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
load_dataset >> task
Connect the steps in the pipeline with >>
(#490)
Now you can connect your steps using the binary shift operator in python:
from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns
with Pipeline(name="Pipe name") as pipeline:
load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
evol_instruction_complexity_1 = EvolInstruct(
llm=OpenAILLM(model="gpt-3.5-turbo"),
)
evol_instruction_complexity_2 = EvolInstruct(
llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
)
combine_columns = CombineColumns(
columns=["response"],
output_columns=["candidates"],
)
(
load_hub_dataset
>> [evol_instruction_complexity_1, evol_instruction_complexity_2]
>> combine_columns
)
Routing batch function (#595)
Thanks to the new routing_batch_function
, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps
routing batch function, making easier replicating the definition of the original UltraFeedback paper:
import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration
@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
load_dataset >> sample_two_steps >> tasks >> combine_generations
Generate structured outputs using outlines
(#601)
You can generate JSON
or regex
using TransformersLLM
, LlamaCppLLM
or vLLM
thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)
from enum import Enum
from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated
class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"
class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
mithril = "mithril"
class Character(BaseModel):
name: Annotated[str, StringConstraints(max_length=30)]
age: conint(gt=1, lt=3000)
armor: Armor
weapon: Weapon
with Pipeline("RPG-characters") as pipeline:
system_prompt = (
"You are a leading role play gamer. You have seen thousands of different characters and their attributes."
" Please return a JSON object with common attributes of an RPG character."
)
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": system_prompt,
"instruction": f"Give me a character description for a {char}",
}
for char in ["dwarf", "elf", "human", "ork"]
],
)
text_generation = TextGeneration(
name="text_generation_rpg",
llm=LlamaCppLLM(
model_path="model/path", # type: ignore
structured_output={"format": "json", "schema": Character},
),
)
load_dataset >> text_generation
New GroqLLM
(#583)
New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0
from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="text-generation-groq") as pipeline:
...
text_generation_with_groq = TextGeneration(
llm=GroqLLM(model="llama3-70b-8192"),
)
...
Easily test your pipeline doing a dry_run
(#635)
with Pipeline(...) as pipeline:
...
distiset = pipeline.dry_run(
parameters=..., # The same argument as `Pipeline.run`
batch_size=1 # Optional, will be set to 1 by default.
)
[05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103
INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125
Pipeline.log
file is dumped to the Hugging Face repository (#568)
Now on when you call distiset.push_to_hub
, the pipeline.log
file will be automatically dumped to your dataset repository with the pipeline.yaml
to keep track of the execution.
New distilabel_metadata
column to store internal data (#586)
You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output
, keep the original output to avoid lossing anything.
You can include the metadata at the task level as:
TextGeneration(..., add_raw_output=True|False)
And directly determine whether you want this column in your final Distiset
:
with Pipeline(...,enable_metadata=True|False):
...
This way we can decide to remove all the column altogether.
All the changes in this PR
- Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
- Fix
llm_blender
installation by @alvarobartt in #557 - Warn user a...
1.0.3
What's Changed
- Add
stop
andstop_sequences
inLLM.generate
subclasses by @alvarobartt in #585
Full Changelog: 1.0.2...1.0.3
1.0.2
What's Changed
- Fix
RuntimeParamater
validation when provided as_Step
attr by @alvarobartt in #564 - Add
seed
withrandom.randint
to ensure cache is not used by @alvarobartt in #571
Full Changelog: 1.0.1...1.0.2
1.0.1
What's Changed
- Fix typo in readme and remove the ToArgilla step by @dvsrepo in #548
- Fix
model_validator
inInferenceEndpoints
due toPipeline
pickling by @alvarobartt in #552
Full Changelog: 1.0.0...1.0.1