-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding metadata grouper component #8512
Conversation
Pull Request Test Coverage Report for Build 11799668570Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found some possible improvements.
@dfokina can you review the docstrings?
Co-authored-by: Stefano Fiorucci <[email protected]>
About the naming: from a grouper, I would expect it to return groups of Documents; this returns an ordered flat list, instead. |
@davidsbatista Let's add it do the pydocs file too, pls |
@component.output_types(documents=List[Document]) | ||
def run(self, documents: List[Document]) -> Dict[str, Any]: | ||
""" | ||
Groups the provided list of documents based on the `group_by` parameter and optionally the `subgroup_by`. | ||
|
||
The output is a list of documents reordered based on how they were grouped. | ||
|
||
:param documents: The list of documents to group. | ||
:returns: | ||
A dictionary with the following keys: | ||
- documents: The list of documents ordered by the `group_by` and `subgroup_by` metadata values. | ||
""" | ||
|
||
if len(documents) == 0: | ||
return {"documents": []} | ||
|
||
# docs based on the 'group_by' value | ||
document_groups, no_group, ordered_keys = self._group_documents(documents) | ||
|
||
# further grouping of the document inside each group based on the 'subgroup_by' value | ||
document_subgroups, subgroup_ordered_keys = self._create_subgroups(document_groups) | ||
|
||
# sort the docs within the groups or subgroups if necessary | ||
result_docs = self._merge_and_sort( | ||
document_groups, document_subgroups, no_group, ordered_keys, subgroup_ordered_keys | ||
) | ||
|
||
return {"documents": result_docs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see problems in a simple implementation like this?
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> Dict[str, Any]:
...
if not documents:
return {"documents": []}
document_groups = defaultdict(lambda: defaultdict(list))
no_group_docs = []
for doc in documents:
group_value = str(doc.meta.get(self.group_by, ""))
if group_value:
subgroup_value = str(doc.meta.get(self.subgroup_by, "no_subgroup")) if self.subgroup_by else "no_subgroup"
document_groups[group_value][subgroup_value].append(doc)
else:
no_group_docs.append(doc)
ordered_docs = []
for group in document_groups:
for subgroup in document_groups[group]:
docs = document_groups[group][subgroup]
if self.sort_docs_by:
docs = sorted(docs, key=lambda d: d.meta.get(self.sort_docs_by, float("inf")))
ordered_docs.extend(docs)
ordered_docs.extend(no_group_docs)
return {"documents": ordered_docs}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ju-gu can you test whether this simpler implementation is suitable for your needs? The tests are satisfied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this seems to work as well, 👍
Co-authored-by: Stefano Fiorucci <[email protected]>
Co-authored-by: Stefano Fiorucci <[email protected]>
Co-authored-by: Stefano Fiorucci <[email protected]>
Co-authored-by: Stefano Fiorucci <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the implementation
@component.output_types(documents=List[Document]) | ||
def run(self, documents: List[Document]) -> Dict[str, Any]: | ||
""" | ||
Groups the provided list of documents based on the `group_by` parameter and optionally the `subgroup_by`. | ||
|
||
The output is a list of documents reordered based on how they were grouped. | ||
|
||
:param documents: The list of documents to group. | ||
:returns: | ||
A dictionary with the following keys: | ||
- documents: The list of documents ordered by the `group_by` and `subgroup_by` metadata values. | ||
""" | ||
|
||
if len(documents) == 0: | ||
return {"documents": []} | ||
|
||
# docs based on the 'group_by' value | ||
document_groups, no_group, ordered_keys = self._group_documents(documents) | ||
|
||
# further grouping of the document inside each group based on the 'subgroup_by' value | ||
document_subgroups, subgroup_ordered_keys = self._create_subgroups(document_groups) | ||
|
||
# sort the docs within the groups or subgroups if necessary | ||
result_docs = self._merge_and_sort( | ||
document_groups, document_subgroups, no_group, ordered_keys, subgroup_ordered_keys | ||
) | ||
|
||
return {"documents": result_docs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this seems to work as well, 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed with David,
I pushed some simplifications and more tests.
I'm going to merge this as soon as the tests pass.
Related Issues
Proposed Changes:
How did you test it?
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.