Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add JSONConverter Component #8397

Merged
merged 8 commits into from
Sep 25, 2024
Merged

feat: Add JSONConverter Component #8397

merged 8 commits into from
Sep 25, 2024

Conversation

silvanocerza
Copy link
Contributor

Related Issues

Proposed Changes:

Add a new JSONConverter Component.

Usage:

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
    "laureates": [
        {
            "firstname": "Enrico",
            "surname": "Fermi",
            "motivation": "for his demonstrations of the existence of new radioactive elements produced "
            "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
            " slow neutrons",
        },
        {
            "firstname": "Rita",
            "surname": "Levi-Montalcini",
            "motivation": "for their discoveries of growth factors",
        },
    ],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
    jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"]
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'

print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
# 'for their discoveries of growth factors'

print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

How did you test it?

I added unit tests.

Notes for the reviewer

N/A

Checklist

@silvanocerza silvanocerza self-assigned this Sep 24, 2024
@silvanocerza silvanocerza requested review from a team as code owners September 24, 2024 11:23
@silvanocerza silvanocerza requested review from dfokina and shadeMe and removed request for a team September 24, 2024 11:23
@coveralls
Copy link
Collaborator

coveralls commented Sep 24, 2024

Pull Request Test Coverage Report for Build 11030215143

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.06%) to 90.297%

Totals Coverage Status
Change from base Build 11011481624: -0.06%
Covered Lines: 7408
Relevant Lines: 8204

💛 - Coveralls

self,
jq_schema: Optional[str] = None,
content_key: Optional[str] = None,
extra_meta_fields: Optional[List[str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extra_meta_fields: Optional[List[str]] = None,
extra_meta_fields: Optional[Set[str]] = None,

"""
Utility function to extract text and metadata from a JSON file.
"""
file_content = source.data.decode("utf-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wrap this in exception handling and handle decoding errors - malformed inputs are a common enough occurrence that it makes sense to rethrow/warn with an expanded error message (w/t the file path).

haystack/components/converters/json.py Outdated Show resolved Hide resolved
haystack/components/converters/json.py Show resolved Hide resolved
haystack/components/converters/json.py Show resolved Hide resolved
haystack/components/converters/json.py Outdated Show resolved Hide resolved
haystack/components/converters/json.py Outdated Show resolved Hide resolved
haystack/components/converters/json.py Show resolved Hide resolved
test/components/converters/test_json.py Show resolved Hide resolved
@component.output_types(documents=List[Document])
def run(
self,
sources: List[Union[str, Path, ByteStream]],
Copy link
Contributor Author

@silvanocerza silvanocerza Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjrl the strings in here are meant to be paths like other converters. If there needs be in the future we can treat them as JSON strings too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I don't think there is any need for us to work directly with JSON strings right now. File paths is fine.

@dfokina
Copy link
Contributor

dfokina commented Sep 24, 2024

Pls add this to pydocs too 🙏

@sjrl
Copy link
Contributor

sjrl commented Sep 25, 2024

Looks great @silvanocerza! My main wish was that this component could also serve as basically a Document.from_dict call and it looks like it can handle that. To give context, we have cases where we create Haystack Documents locally and then want to load them directly in dC as JSON files.

I think the only request/question I have is it possible to set one content key and then say all other fields should automatically be treated as metadata fields? We often have the case that data is messy (go figure) and not all files we are ingesting have the same metadata keys (but the same content key) and we'd want to make sure we grab all of them and needing to manually specify could be cumbersome.

@dfokina
Copy link
Contributor

dfokina commented Sep 25, 2024

Going to push some docstring updates real quick 👀

@silvanocerza silvanocerza merged commit 29672d4 into main Sep 25, 2024
21 checks passed
@silvanocerza silvanocerza deleted the json-converter branch September 25, 2024 10:34
@dfokina dfokina mentioned this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON Convertor : Pipeline Component
5 participants