Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding metadata grouper component #8512

Merged
merged 36 commits into from
Nov 12, 2024
Merged

feat: adding metadata grouper component #8512

merged 36 commits into from
Nov 12, 2024

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Oct 31, 2024

Related Issues

Proposed Changes:

  • Adding a new component to group/rank documents based on specified metadata fields

How did you test it?

  • unit tests, integration tests, manual verification

Checklist

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Oct 31, 2024
@coveralls
Copy link
Collaborator

coveralls commented Nov 1, 2024

Pull Request Test Coverage Report for Build 11799668570

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 90.177%

Totals Coverage Status
Change from base Build 11798792334: 0.04%
Covered Lines: 7776
Relevant Lines: 8623

💛 - Coveralls

@davidsbatista davidsbatista changed the title Add metadata grouper feat: adding metadata grouper component Nov 1, 2024
@davidsbatista davidsbatista marked this pull request as ready for review November 1, 2024 09:50
@davidsbatista davidsbatista requested review from a team as code owners November 1, 2024 09:50
@davidsbatista davidsbatista requested review from dfokina and anakin87 and removed request for a team November 1, 2024 09:50
@sjrl sjrl requested a review from ju-gu November 4, 2024 08:26
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found some possible improvements.

@dfokina can you review the docstrings?

test/components/rankers/test_metadata_grouper.py Outdated Show resolved Hide resolved
haystack/components/rankers/metadata_grouper.py Outdated Show resolved Hide resolved
haystack/components/rankers/metadata_grouper.py Outdated Show resolved Hide resolved
test/components/rankers/test_metadata_grouper.py Outdated Show resolved Hide resolved
@anakin87
Copy link
Member

About the naming: from a grouper, I would expect it to return groups of Documents; this returns an ordered flat list, instead.
Any other ideas? @bilgeyucel

@dfokina
Copy link
Contributor

dfokina commented Nov 11, 2024

@davidsbatista Let's add it do the pydocs file too, pls

Comment on lines 179 to 206
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> Dict[str, Any]:
"""
Groups the provided list of documents based on the `group_by` parameter and optionally the `subgroup_by`.

The output is a list of documents reordered based on how they were grouped.

:param documents: The list of documents to group.
:returns:
A dictionary with the following keys:
- documents: The list of documents ordered by the `group_by` and `subgroup_by` metadata values.
"""

if len(documents) == 0:
return {"documents": []}

# docs based on the 'group_by' value
document_groups, no_group, ordered_keys = self._group_documents(documents)

# further grouping of the document inside each group based on the 'subgroup_by' value
document_subgroups, subgroup_ordered_keys = self._create_subgroups(document_groups)

# sort the docs within the groups or subgroups if necessary
result_docs = self._merge_and_sort(
document_groups, document_subgroups, no_group, ordered_keys, subgroup_ordered_keys
)

return {"documents": result_docs}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see problems in a simple implementation like this?

    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]) -> Dict[str, Any]:
    ...
        if not documents:
            return {"documents": []}

        document_groups = defaultdict(lambda: defaultdict(list))
        no_group_docs = []

        for doc in documents:
            group_value = str(doc.meta.get(self.group_by, ""))
            if group_value:
                subgroup_value = str(doc.meta.get(self.subgroup_by, "no_subgroup")) if self.subgroup_by else "no_subgroup"
                document_groups[group_value][subgroup_value].append(doc)
            else:
                no_group_docs.append(doc)

        ordered_docs = []
        for group in document_groups:
            for subgroup in document_groups[group]:
                docs = document_groups[group][subgroup]
                if self.sort_docs_by:
                    docs = sorted(docs, key=lambda d: d.meta.get(self.sort_docs_by, float("inf")))
                ordered_docs.extend(docs)

        ordered_docs.extend(no_group_docs)

        return {"documents": ordered_docs}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ju-gu can you test whether this simpler implementation is suitable for your needs? The tests are satisfied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this seems to work as well, 👍

haystack/components/rankers/__init__.py Outdated Show resolved Hide resolved
haystack/components/rankers/meta_field_grouper_ranker.py Outdated Show resolved Hide resolved
test/components/rankers/test_metadata_grouper.py Outdated Show resolved Hide resolved
test/components/rankers/test_metadata_grouper.py Outdated Show resolved Hide resolved
Copy link
Member

@ju-gu ju-gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the implementation

Comment on lines 179 to 206
@component.output_types(documents=List[Document])
def run(self, documents: List[Document]) -> Dict[str, Any]:
"""
Groups the provided list of documents based on the `group_by` parameter and optionally the `subgroup_by`.

The output is a list of documents reordered based on how they were grouped.

:param documents: The list of documents to group.
:returns:
A dictionary with the following keys:
- documents: The list of documents ordered by the `group_by` and `subgroup_by` metadata values.
"""

if len(documents) == 0:
return {"documents": []}

# docs based on the 'group_by' value
document_groups, no_group, ordered_keys = self._group_documents(documents)

# further grouping of the document inside each group based on the 'subgroup_by' value
document_subgroups, subgroup_ordered_keys = self._create_subgroups(document_groups)

# sort the docs within the groups or subgroups if necessary
result_docs = self._merge_and_sort(
document_groups, document_subgroups, no_group, ordered_keys, subgroup_ordered_keys
)

return {"documents": result_docs}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this seems to work as well, 👍

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed with David,
I pushed some simplifications and more tests.
I'm going to merge this as soon as the tests pass.

@anakin87 anakin87 merged commit e5a8072 into main Nov 12, 2024
19 checks passed
@anakin87 anakin87 deleted the add-metadata-grouper branch November 12, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants