-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate out indexing-time image analysis into new phase #4228
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces significant changes to separate indexing-time image analysis into a new process across the codebase. Here's a summary of the key changes and potential issues:
- Introduces specialized
TextSection
andImageSection
models to replace the genericSection
model, providing clearer separation between text and image content - Removes image summarization during indexing, deferring it to a new
process_image_sections
step in the indexing pipeline - Removes
VisionEnabledConnector
mixin and LLM dependencies from connectors that handle images - Modifies all connectors to use
TextSection
for text content consistently
Key points to address:
- The
models.py
file has a commented-out duplicate definition ofTextSection
that should be removed ImageSection
inherits fromTextSection
but makesimage_file_name
required, which could be confusing - consider making them separate classes- The change to
DocumentBase.sections
type is a breaking change that needs careful review across all implementations - Consider adding validation to ensure
TextSection
andImageSection
are used appropriately in each context - Add documentation for the new image processing pipeline to help users understand the changes
The changes improve code organization but require careful review to ensure consistent implementation across all connectors.
51 file(s) reviewed, 14 comment(s)
Edit PR Review Bot Settings | Greptile
) -> tuple[list[TextSection], dict[str, str | list[str]]]: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The docstring still references 'list of Sections' in the return type description but the actual return type is now list[TextSection]
) -> tuple[list[TextSection], dict[str, str | list[str]]]: | |
""" | |
) -> tuple[list[TextSection], dict[str, str | list[str]]]: | |
"""Process a single Airtable field and return text sections or metadata. | |
Args: | |
field_name: Name of the field | |
field_info: Raw field information from Airtable | |
field_type: Airtable field type | |
Returns: | |
(list of TextSections, dict of metadata) | |
""" |
id=af.doc_id, | ||
sections=[Section(link=af.link, text=reply) for reply in af.responses], | ||
sections=[TextSection(link=af.link, text=reply) for reply in af.responses], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Consider adding initial_content to the sections list before responses, as it's currently being dropped
id=af.doc_id, | |
sections=[Section(link=af.link, text=reply) for reply in af.responses], | |
sections=[TextSection(link=af.link, text=reply) for reply in af.responses], | |
id=af.doc_id, | |
sections=[TextSection(link=af.link, text=af.initial_content)] + [TextSection(link=af.link, text=reply) for reply in af.responses], |
@@ -167,7 +167,7 @@ def _page_to_document( | |||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider removing or configuring this hardcoded sleep delay. Use rate limiting or configurable delays instead
if pdf_metadata: | ||
metadata.update(pdf_metadata) | ||
|
||
# TODO: Handle embedded images in PDFs if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: TODO comment left in production code about handling embedded images in PDFs
# Get file extension and determine file type | ||
file_extension = Path(file_name).suffix.lower().lstrip(".") | ||
mime_type = metadata.get("mime_type", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: File extension check should use get_file_ext() for consistency with rest of codebase, not Path.suffix
# Get file extension and determine file type | |
file_extension = Path(file_name).suffix.lower().lstrip(".") | |
mime_type = metadata.get("mime_type", "") | |
# Get file extension and determine file type | |
file_extension = get_file_ext(file_name).lstrip(".") | |
mime_type = metadata.get("mime_type", "") |
backend/onyx/connectors/models.py
Outdated
class Section(BaseModel): | ||
text: str | ||
link: str | None = None | ||
image_file_name: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Section class is now redundant with TextSection and should be removed or deprecated with a migration plan
backend/onyx/connectors/models.py
Outdated
class ImageSection(TextSection): | ||
image_file_name: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider making ImageSection a standalone class instead of inheriting from TextSection to avoid confusion with optional vs required image_file_name
summary_text = ( | ||
summarize_image_with_error_handling(llm, image_data, display_name) or "" | ||
) | ||
return ImageSection(text="", image_file_name=""), None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Empty image_file_name in error case could cause issues downstream if code expects non-empty image_file_name for ImageSection objects
return ImageSection(text="", image_file_name=""), None | |
return ImageSection(text="", image_file_name=None), None |
# Get the image data from PGFileStore | ||
try: | ||
with get_session_with_current_tenant() as db_session: | ||
pgfilestore = get_pgfilestore_by_file_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Database session should be created outside the loop to avoid repeatedly opening/closing sessions for each image in a document
# Get the image data | ||
pgfilestore_data = read_lobj( | ||
pgfilestore.lobj_oid, db_session | ||
).read() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: read_lobj() returns a file-like object that should be properly closed after reading
# Get the image data | |
pgfilestore_data = read_lobj( | |
pgfilestore.lobj_oid, db_session | |
).read() | |
# Get the image data | |
with read_lobj(pgfilestore.lobj_oid, db_session) as lobj: | |
pgfilestore_data = lobj.read() |
106f1e4
to
39d98b2
Compare
c3ca9fc
to
a1ab2e4
Compare
8ddae60
to
5881ab5
Compare
Description
Refactor so that we can keep connectors more focused. Instead of doing image analysis within connectors and having connector-specific handling there, we simply yield back ImageSections / TextSections that are then converted to Sections (as part of the indexing pipeline)
Also some misc Drive improvements for downloading media
N.B. I can see an argument for doing this conversion earlier in the pipeline– open to thoughts
Fixes https://linear.app/danswer/issue/DAN-1536/image-search-enhancements
How Has This Been Tested?
Various image formats in:
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.