Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate out indexing-time image analysis into new phase #4228

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Mar 7, 2025

Description

Refactor so that we can keep connectors more focused. Instead of doing image analysis within connectors and having connector-specific handling there, we simply yield back ImageSections / TextSections that are then converted to Sections (as part of the indexing pipeline)

Also some misc Drive improvements for downloading media

N.B. I can see an argument for doing this conversion earlier in the pipeline– open to thoughts

Fixes https://linear.app/danswer/issue/DAN-1536/image-search-enhancements

How Has This Been Tested?

Various image formats in:

  • Drive
  • File
  • Confluence

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@pablonyx pablonyx requested a review from a team as a code owner March 7, 2025 01:41
Copy link

vercel bot commented Mar 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 8, 2025 11:25pm

@pablonyx pablonyx changed the title Separate out indexing-time image analysis into new pr Separate out indexing-time image analysis into new phase Mar 7, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces significant changes to separate indexing-time image analysis into a new process across the codebase. Here's a summary of the key changes and potential issues:

  • Introduces specialized TextSection and ImageSection models to replace the generic Section model, providing clearer separation between text and image content
  • Removes image summarization during indexing, deferring it to a new process_image_sections step in the indexing pipeline
  • Removes VisionEnabledConnector mixin and LLM dependencies from connectors that handle images
  • Modifies all connectors to use TextSection for text content consistently

Key points to address:

  • The models.py file has a commented-out duplicate definition of TextSection that should be removed
  • ImageSection inherits from TextSection but makes image_file_name required, which could be confusing - consider making them separate classes
  • The change to DocumentBase.sections type is a breaking change that needs careful review across all implementations
  • Consider adding validation to ensure TextSection and ImageSection are used appropriately in each context
  • Add documentation for the new image processing pipeline to help users understand the changes

The changes improve code organization but require careful review to ensure consistent implementation across all connectors.

51 file(s) reviewed, 14 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +270 to 271
) -> tuple[list[TextSection], dict[str, str | list[str]]]:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The docstring still references 'list of Sections' in the return type description but the actual return type is now list[TextSection]

Suggested change
) -> tuple[list[TextSection], dict[str, str | list[str]]]:
"""
) -> tuple[list[TextSection], dict[str, str | list[str]]]:
"""Process a single Airtable field and return text sections or metadata.
Args:
field_name: Name of the field
field_info: Raw field information from Airtable
field_type: Airtable field type
Returns:
(list of TextSections, dict of metadata)
"""

Comment on lines 223 to +224
id=af.doc_id,
sections=[Section(link=af.link, text=reply) for reply in af.responses],
sections=[TextSection(link=af.link, text=reply) for reply in af.responses],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Consider adding initial_content to the sections list before responses, as it's currently being dropped

Suggested change
id=af.doc_id,
sections=[Section(link=af.link, text=reply) for reply in af.responses],
sections=[TextSection(link=af.link, text=reply) for reply in af.responses],
id=af.doc_id,
sections=[TextSection(link=af.link, text=af.initial_content)] + [TextSection(link=af.link, text=reply) for reply in af.responses],

@@ -167,7 +167,7 @@ def _page_to_document(
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider removing or configuring this hardcoded sleep delay. Use rate limiting or configurable delays instead

if pdf_metadata:
metadata.update(pdf_metadata)

# TODO: Handle embedded images in PDFs if needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: TODO comment left in production code about handling embedded images in PDFs

Comment on lines 119 to 121
# Get file extension and determine file type
file_extension = Path(file_name).suffix.lower().lstrip(".")
mime_type = metadata.get("mime_type", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: File extension check should use get_file_ext() for consistency with rest of codebase, not Path.suffix

Suggested change
# Get file extension and determine file type
file_extension = Path(file_name).suffix.lower().lstrip(".")
mime_type = metadata.get("mime_type", "")
# Get file extension and determine file type
file_extension = get_file_ext(file_name).lstrip(".")
mime_type = metadata.get("mime_type", "")

Comment on lines 33 to 38
class Section(BaseModel):
text: str
link: str | None = None
image_file_name: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Section class is now redundant with TextSection and should be removed or deprecated with a migration plan

Comment on lines 45 to 51
class ImageSection(TextSection):
image_file_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider making ImageSection a standalone class instead of inheriting from TextSection to avoid confusion with optional vs required image_file_name

summary_text = (
summarize_image_with_error_handling(llm, image_data, display_name) or ""
)
return ImageSection(text="", image_file_name=""), None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Empty image_file_name in error case could cause issues downstream if code expects non-empty image_file_name for ImageSection objects

Suggested change
return ImageSection(text="", image_file_name=""), None
return ImageSection(text="", image_file_name=None), None

Comment on lines 347 to 362
# Get the image data from PGFileStore
try:
with get_session_with_current_tenant() as db_session:
pgfilestore = get_pgfilestore_by_file_name(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Database session should be created outside the loop to avoid repeatedly opening/closing sessions for each image in a document

Comment on lines 367 to 390
# Get the image data
pgfilestore_data = read_lobj(
pgfilestore.lobj_oid, db_session
).read()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: read_lobj() returns a file-like object that should be properly closed after reading

Suggested change
# Get the image data
pgfilestore_data = read_lobj(
pgfilestore.lobj_oid, db_session
).read()
# Get the image data
with read_lobj(pgfilestore.lobj_oid, db_session) as lobj:
pgfilestore_data = lobj.read()

@pablonyx pablonyx force-pushed the new_image_processing_step branch from 8ddae60 to 5881ab5 Compare March 8, 2025 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant