Allow configuring HTML embedding #102

boswelja · 2024-12-15T02:07:35Z

Currently, the HTML embedder splits a webpage into sections for each paragraph, code block, header etc.

This works pretty great for articles and blog posts, where information is divided as such and each paragraph can be used independently. However, this falls apart for most documentation as it loses the relationship between a paragraph explaining code, and the code it's explaining.

It would be cool if this splitting behavior was configurable. I think the default behavior is fine for most cases, but the added ability of either only splitting on certain tokens, or not splitting at all would also be great options to have.

akshayballal95 · 2024-12-15T11:29:40Z

Yes. It makes sense. But most applications may require splitting as people use it for RAG and the chunky size needs to be moderate. Do you have any ideas in mind for the splitting strategy. One option that I can think of is to convert the webpage to markdown style format and chunk it like any other markdown. Let me know if you have any other alternate option.

boswelja · 2024-12-16T07:38:35Z

Yeah I think the current splitting strategy makes sense and works well for most sites, it's just the few where it doesn't 😅

Something that groups by content under a heading would be ideal for the remaining "some" I think - if we can convert to Markdown to keep the context of Heading, paragraph, code etc then that'd be even better!

akshayballal95 · 2024-12-16T15:58:14Z

Great, I will add some splitting strategy options that can be passed in the function signature.

boswelja · 2025-02-11T11:15:38Z

For splitting strategy options, what do you think about adding

    /// Splits text-based content by paragraph, where each paragraph is its own embedding
    Paragraph,
    /// Splits text-based content into sections denoted by headers. Each header-denoted section
    /// becomes its own embedding.
    Section,

If it seems like a good idea, I can figure out the implementation. I just want to make sure I'm going about this the right way before starting

akshayballal95 · 2025-02-12T14:05:31Z

If I understood this correctly, what you are suggesting is to find headers and get all text under that header. And further split it if its larger than max character limit. Is that right? I think a simple implementation would be to convert the HTML to markdown using htmd and then use MarkdownSplitter to achieve this. How does that sound?

akshayballal95 · 2025-02-12T14:07:24Z

And if this is implemented, on the user side, we can give an option on how to split, based on tags or based on Markdown from top to bottom.

boswelja · 2025-02-12T23:10:40Z

Sounds good, I can start with converting HTML to Markdown and see where we go from there 👍

boswelja · 2025-02-13T08:16:25Z

My original plan to get started was to run the HTML through htmd and feed that back into MarkdownProcessor, but it seems like we just turn Markdown back into plaintext. Do you think it's worth retaining the formatting for MarkdownProcessor?
It probably does need to be stripped when generating embeddings, but the raw text would keep it 🤔

akshayballal95 · 2025-02-13T21:32:43Z

Yes. I think we should retain the markdown formatting. The reason I had it like this was because a library that I was using earlier to parse markdown panicked when using a markdown without frontmatter.

But now that I think about it a good strategy would be to just return all the text in markdown and use MarkdownSplitter. This will take care of handing the markdown headers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring HTML embedding #102

Allow configuring HTML embedding #102

boswelja commented Dec 15, 2024

akshayballal95 commented Dec 15, 2024

boswelja commented Dec 16, 2024

akshayballal95 commented Dec 16, 2024

boswelja commented Feb 11, 2025

akshayballal95 commented Feb 12, 2025

akshayballal95 commented Feb 12, 2025

boswelja commented Feb 12, 2025

boswelja commented Feb 13, 2025

akshayballal95 commented Feb 13, 2025

Allow configuring HTML embedding #102

Allow configuring HTML embedding #102

Comments

boswelja commented Dec 15, 2024

akshayballal95 commented Dec 15, 2024

boswelja commented Dec 16, 2024

akshayballal95 commented Dec 16, 2024

boswelja commented Feb 11, 2025

akshayballal95 commented Feb 12, 2025

akshayballal95 commented Feb 12, 2025

boswelja commented Feb 12, 2025

boswelja commented Feb 13, 2025

akshayballal95 commented Feb 13, 2025