Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix hybrid chunker token constraint #131

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

vagenas
Copy link
Collaborator

@vagenas vagenas commented Jan 17, 2025

Resolves DS4SD/docling#723

  • Root cause was that delimiters (both (1) between headings/captions & text, and (2) within headings/captions) were not properly accounted for when calculating the available number of tokens, which would manifest itself with different tokenizers (e.g. granite) or even with the default tokenizer (all-MiniLM-L6-v2) when used with more complex delimiters like ####.
  • Refactored its inner workings to reuse common API elements (.serialize(), .delim) instead of unnecessarily replicating serialization logic.
  • Added test to cover addressed case.

Copy link

mergify bot commented Jan 17, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@dolfim-ibm
Copy link
Contributor

python3.13 should be fixed with f79791f

@vagenas vagenas force-pushed the fix-hybrid-chunking-token-constraint branch from f763699 to be66e5e Compare January 17, 2025 16:15
@vagenas vagenas requested review from cau-git and dolfim-ibm January 17, 2025 16:16
@vagenas vagenas marked this pull request as ready for review January 17, 2025 16:16
@vagenas vagenas merged commit b741eea into main Jan 17, 2025
8 checks passed
@vagenas vagenas deleted the fix-hybrid-chunking-token-constraint branch January 17, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slight token number mismatches from hybrid chunker
3 participants