-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into dalle-image-gen
- Loading branch information
Showing
22 changed files
with
417 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
118 changes: 118 additions & 0 deletions
118
haystack/components/rankers/meta_field_grouping_ranker.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]> | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from collections import defaultdict | ||
from typing import Any, Dict, List, Optional, cast | ||
|
||
from haystack import Document, component, logging | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@component | ||
class MetaFieldGroupingRanker: | ||
""" | ||
Reorders the documents by grouping them based on metadata keys. | ||
The MetaFieldGroupingRanker can group documents by a primary metadata key `group_by`, and subgroup them with an optional | ||
secondary key, `subgroup_by`. | ||
Within each group or subgroup, it can also sort documents by a metadata key `sort_docs_by`. | ||
The output is a flat list of documents ordered by `group_by` and `subgroup_by` values. | ||
Any documents without a group are placed at the end of the list. | ||
The proper organization of documents helps improve the efficiency and performance of subsequent processing by an LLM. | ||
### Usage example | ||
```python | ||
from haystack.components.rankers import MetaFieldGroupingRanker | ||
from haystack.dataclasses import Document | ||
docs = [ | ||
Document(content="Javascript is a popular programming language", meta={"group": "42", "split_id": 7, "subgroup": "subB"}), | ||
Document(content="Python is a popular programming language",meta={"group": "42", "split_id": 4, "subgroup": "subB"}), | ||
Document(content="A chromosome is a package of DNA", meta={"group": "314", "split_id": 2, "subgroup": "subC"}), | ||
Document(content="An octopus has three hearts", meta={"group": "11", "split_id": 2, "subgroup": "subD"}), | ||
Document(content="Java is a popular programming language", meta={"group": "42", "split_id": 3, "subgroup": "subB"}) | ||
] | ||
ranker = MetaFieldGroupingRanker(group_by="group",subgroup_by="subgroup", sort_docs_by="split_id") | ||
result = ranker.run(documents=docs) | ||
print(result["documents"]) | ||
# [ | ||
# Document(id=d665bbc83e52c08c3d8275bccf4f22bf2bfee21c6e77d78794627637355b8ebc, | ||
# content: 'Java is a popular programming language', meta: {'group': '42', 'split_id': 3, 'subgroup': 'subB'}), | ||
# Document(id=a20b326f07382b3cbf2ce156092f7c93e8788df5d48f2986957dce2adb5fe3c2, | ||
# content: 'Python is a popular programming language', meta: {'group': '42', 'split_id': 4, 'subgroup': 'subB'}), | ||
# Document(id=ce12919795d22f6ca214d0f161cf870993889dcb146f3bb1b3e1ffdc95be960f, | ||
# content: 'Javascript is a popular programming language', meta: {'group': '42', 'split_id': 7, 'subgroup': 'subB'}), | ||
# Document(id=d9fc857046c904e5cf790b3969b971b1bbdb1b3037d50a20728fdbf82991aa94, | ||
# content: 'A chromosome is a package of DNA', meta: {'group': '314', 'split_id': 2, 'subgroup': 'subC'}), | ||
# Document(id=6d3b7bdc13d09aa01216471eb5fb0bfdc53c5f2f3e98ad125ff6b85d3106c9a3, | ||
# content: 'An octopus has three hearts', meta: {'group': '11', 'split_id': 2, 'subgroup': 'subD'}) | ||
# ] | ||
``` | ||
""" # noqa: E501 | ||
|
||
def __init__(self, group_by: str, subgroup_by: Optional[str] = None, sort_docs_by: Optional[str] = None): | ||
""" | ||
Creates an instance of DeepsetMetadataGrouper. | ||
:param group_by: The metadata key to aggregate the documents by. | ||
:param subgroup_by: The metadata key to aggregate the documents within a group that was created by the | ||
`group_by` key. | ||
:param sort_docs_by: Determines which metadata key is used to sort the documents. If not provided, the | ||
documents within the groups or subgroups are not sorted and are kept in the same order as | ||
they were inserted in the subgroups. | ||
""" | ||
self.group_by = group_by | ||
self.sort_docs_by = sort_docs_by | ||
self.subgroup_by = subgroup_by | ||
|
||
@component.output_types(documents=List[Document]) | ||
def run(self, documents: List[Document]) -> Dict[str, Any]: | ||
""" | ||
Groups the provided list of documents based on the `group_by` parameter and optionally the `subgroup_by`. | ||
The output is a list of documents reordered based on how they were grouped. | ||
:param documents: The list of documents to group. | ||
:returns: | ||
A dictionary with the following keys: | ||
- documents: The list of documents ordered by the `group_by` and `subgroup_by` metadata values. | ||
""" | ||
|
||
if not documents: | ||
return {"documents": []} | ||
|
||
document_groups: Dict[str, Dict[str, List[Document]]] = defaultdict(lambda: defaultdict(list)) | ||
no_group_docs = [] | ||
|
||
for doc in documents: | ||
group_value = str(doc.meta.get(self.group_by, "")) | ||
|
||
if group_value: | ||
subgroup_value = "no_subgroup" | ||
if self.subgroup_by and self.subgroup_by in doc.meta: | ||
subgroup_value = doc.meta[self.subgroup_by] | ||
|
||
document_groups[group_value][subgroup_value].append(doc) | ||
else: | ||
no_group_docs.append(doc) | ||
|
||
ordered_docs = [] | ||
for group in document_groups: | ||
for subgroup in document_groups[group]: | ||
docs = document_groups[group][subgroup] | ||
if self.sort_docs_by: | ||
docs.sort(key=lambda d: d.meta.get(cast(str, self.sort_docs_by), float("inf"))) | ||
ordered_docs.extend(docs) | ||
|
||
ordered_docs.extend(no_group_docs) | ||
|
||
return {"documents": ordered_docs} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
releasenotes/notes/add-logs-empty-files-pdf-f28a14e52984c1ea.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
features: | ||
- | | ||
Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. | ||
This can happen if the PDF file is a scanned image. | ||
Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. | ||
This behavior was already occurring, but now its clearer through logs that this is happening. |
4 changes: 4 additions & 0 deletions
4
releasenotes/notes/add-metadata-grouper-21ec05fd4a307425.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
--- | ||
features: | ||
- | | ||
We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
features: | ||
- | | ||
Add TTFT (Time-to-First-Token) support for OpenAI generators. This | ||
captures the time taken to generate the first token from the model and | ||
can be used to analyze the latency of the application. |
Oops, something went wrong.