Separate out indexing-time image analysis into new phase #4228

pablonyx · 2025-03-07T01:41:56Z

Description

Refactor so that we can keep connectors more focused. Instead of doing image analysis within connectors and having connector-specific handling there, we simply yield back ImageSections / TextSections that are then converted to Sections (as part of the indexing pipeline)

Also some misc Drive improvements for downloading media

N.B. I can see an argument for doing this conversion earlier in the pipeline– open to thoughts

Fixes https://linear.app/danswer/issue/DAN-1536/image-search-enhancements

How Has This Been Tested?

Various image formats in:

Drive
File
Confluence

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-03-07T01:42:01Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 8, 2025 11:25pm

greptile-apps

PR Summary

This PR introduces significant changes to separate indexing-time image analysis into a new process across the codebase. Here's a summary of the key changes and potential issues:

Introduces specialized TextSection and ImageSection models to replace the generic Section model, providing clearer separation between text and image content
Removes image summarization during indexing, deferring it to a new process_image_sections step in the indexing pipeline
Removes VisionEnabledConnector mixin and LLM dependencies from connectors that handle images
Modifies all connectors to use TextSection for text content consistently

Key points to address:

The models.py file has a commented-out duplicate definition of TextSection that should be removed
ImageSection inherits from TextSection but makes image_file_name required, which could be confusing - consider making them separate classes
The change to DocumentBase.sections type is a breaking change that needs careful review across all implementations
Consider adding validation to ensure TextSection and ImageSection are used appropriately in each context
Add documentation for the new image processing pipeline to help users understand the changes

The changes improve code organization but require careful review to ensure consistent implementation across all connectors.

_{51 file(s) reviewed, 14 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-03-07T01:43:08Z

backend/onyx/connectors/airtable/airtable_connector.py

+    ) -> tuple[list[TextSection], dict[str, str | list[str]]]:
        """


style: The docstring still references 'list of Sections' in the return type description but the actual return type is now list[TextSection]

Suggested change

) -> tuple[list[TextSection], dict[str, str | list[str]]]:

"""

) -> tuple[list[TextSection], dict[str, str | list[str]]]:

"""Process a single Airtable field and return text sections or metadata.

Args:

field_name: Name of the field

field_info: Raw field information from Airtable

field_type: Airtable field type

Returns:

(list of TextSections, dict of metadata)

"""

greptile-apps · 2025-03-07T01:43:22Z

backend/onyx/connectors/axero/connector.py

        id=af.doc_id,
-        sections=[Section(link=af.link, text=reply) for reply in af.responses],
+        sections=[TextSection(link=af.link, text=reply) for reply in af.responses],


logic: Consider adding initial_content to the sections list before responses, as it's currently being dropped

Suggested change

id=af.doc_id,

sections=[Section(link=af.link, text=reply) for reply in af.responses],

sections=[TextSection(link=af.link, text=reply) for reply in af.responses],

id=af.doc_id,

sections=[TextSection(link=af.link, text=af.initial_content)] + [TextSection(link=af.link, text=reply) for reply in af.responses],

greptile-apps · 2025-03-07T01:44:48Z

backend/onyx/connectors/bookstack/connector.py

@@ -167,7 +167,7 @@ def _page_to_document(
        time.sleep(0.1)


style: Consider removing or configuring this hardcoded sleep delay. Use rate limiting or configurable delays instead

greptile-apps · 2025-03-07T01:46:51Z

backend/onyx/connectors/file/connector.py

+            if pdf_metadata:
+                metadata.update(pdf_metadata)
+
+            # TODO: Handle embedded images in PDFs if needed


style: TODO comment left in production code about handling embedded images in PDFs

greptile-apps · 2025-03-07T01:46:52Z

backend/onyx/connectors/file/connector.py

+    # Get file extension and determine file type
+    file_extension = Path(file_name).suffix.lower().lstrip(".")
+    mime_type = metadata.get("mime_type", "")


style: File extension check should use get_file_ext() for consistency with rest of codebase, not Path.suffix

Suggested change

# Get file extension and determine file type

file_extension = Path(file_name).suffix.lower().lstrip(".")

mime_type = metadata.get("mime_type", "")

# Get file extension and determine file type

file_extension = get_file_ext(file_name).lstrip(".")

mime_type = metadata.get("mime_type", "")

greptile-apps · 2025-03-07T01:50:15Z

backend/onyx/connectors/models.py

 class Section(BaseModel):
    text: str
    link: str | None = None
    image_file_name: str | None = None


logic: Section class is now redundant with TextSection and should be removed or deprecated with a migration plan

greptile-apps · 2025-03-07T01:50:15Z

backend/onyx/connectors/models.py

+class ImageSection(TextSection):
+    image_file_name: str


style: Consider making ImageSection a standalone class instead of inheriting from TextSection to avoid confusion with optional vs required image_file_name

greptile-apps · 2025-03-07T01:52:08Z

backend/onyx/file_processing/image_utils.py

-        summary_text = (
-            summarize_image_with_error_handling(llm, image_data, display_name) or ""
-        )
+        return ImageSection(text="", image_file_name=""), None


logic: Empty image_file_name in error case could cause issues downstream if code expects non-empty image_file_name for ImageSection objects

Suggested change

return ImageSection(text="", image_file_name=""), None

return ImageSection(text="", image_file_name=None), None

greptile-apps · 2025-03-07T01:52:21Z

backend/onyx/indexing/indexing_pipeline.py

+            # Get the image data from PGFileStore
+            try:
+                with get_session_with_current_tenant() as db_session:
+                    pgfilestore = get_pgfilestore_by_file_name(


style: Database session should be created outside the loop to avoid repeatedly opening/closing sessions for each image in a document

greptile-apps · 2025-03-07T01:52:22Z

backend/onyx/indexing/indexing_pipeline.py

+                    # Get the image data
+                    pgfilestore_data = read_lobj(
+                        pgfilestore.lobj_oid, db_session
+                    ).read()
+


logic: read_lobj() returns a file-like object that should be properly closed after reading

Suggested change

# Get the image data

pgfilestore_data = read_lobj(

pgfilestore.lobj_oid, db_session

).read()

# Get the image data

with read_lobj(pgfilestore.lobj_oid, db_session) as lobj:

pgfilestore_data = lobj.read()

pablonyx requested a review from a team as a code owner March 7, 2025 01:41

pablonyx changed the title ~~Separate out indexing-time image analysis into new pr~~ Separate out indexing-time image analysis into new phase Mar 7, 2025

vercel bot deployed to Preview March 7, 2025 01:44 View deployment

greptile-apps bot reviewed Mar 7, 2025

View reviewed changes

vercel bot deployed to Preview March 7, 2025 02:18 View deployment

vercel bot deployed to Preview March 7, 2025 02:46 View deployment

pablonyx force-pushed the new_image_processing_step branch from 106f1e4 to 39d98b2 Compare March 8, 2025 20:10

vercel bot deployed to Preview March 8, 2025 20:13 View deployment

vercel bot deployed to Preview March 8, 2025 21:33 View deployment

pablonyx force-pushed the new_image_processing_step branch from c3ca9fc to a1ab2e4 Compare March 8, 2025 22:44

pablonyx added 11 commits March 8, 2025 14:48

k

cd500b9

smaller fix

eecea40

k

a357c64

k

939ec26

k

40e2c42

nit

b28b913

half working state

ecccf31

k

8193607

k

3d0abee

well functioning

cee0b8f

nit

5881ab5

pablonyx force-pushed the new_image_processing_step branch from 8ddae60 to 5881ab5 Compare March 8, 2025 22:49

pablonyx added 2 commits March 8, 2025 14:53

final testing

5800b18

remove unnecssary logs

49bc6ed

vercel bot deployed to Preview March 8, 2025 23:00 View deployment

pablonyx added 2 commits March 8, 2025 15:03

k

91bf7a5

updates

792961f

vercel bot deployed to Preview March 8, 2025 23:08 View deployment

k

c8c4fa9

nit

1ba1a24

vercel bot deployed to Preview March 8, 2025 23:21 View deployment

typing

0576186

vercel bot deployed to Preview March 8, 2025 23:25 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate out indexing-time image analysis into new phase #4228

Separate out indexing-time image analysis into new phase #4228

pablonyx commented Mar 7, 2025 •

edited

Loading

vercel bot commented Mar 7, 2025 •

edited

Loading

greptile-apps bot left a comment

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

greptile-apps bot Mar 7, 2025

		) -> tuple[list[TextSection], dict[str, str \| list[str]]]:
		"""

-    ) -> tuple[list[TextSection], dict[str, str | list[str]]]:
-        """
+    ) -> tuple[list[TextSection], dict[str, str | list[str]]]:
+        """Process a single Airtable field and return text sections or metadata.
+        Args:
+            field_name: Name of the field
+            field_info: Raw field information from Airtable
+            field_type: Airtable field type
+        Returns:
+            (list of TextSections, dict of metadata)
+        """

	return ImageSection(text="", image_file_name=""), None
	return ImageSection(text="", image_file_name=None), None

Separate out indexing-time image analysis into new phase #4228

Are you sure you want to change the base?

Separate out indexing-time image analysis into new phase #4228

Conversation

pablonyx commented Mar 7, 2025 • edited Loading

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

vercel bot commented Mar 7, 2025 • edited Loading

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

greptile-apps bot Mar 7, 2025

Choose a reason for hiding this comment

pablonyx commented Mar 7, 2025 •

edited

Loading

vercel bot commented Mar 7, 2025 •

edited

Loading