Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token confidence #7

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Token confidence #7

wants to merge 2 commits into from

Conversation

willangley
Copy link
Owner

Switch to per-token confidence when writing out PDFs. This will (eventually!) let us not lose lines of text when part of them has been damaged.

@willangley willangley self-assigned this Nov 22, 2021
@willangley
Copy link
Owner Author

This isn't ready yet.

  • --min_confidence=0.9 discards too much good text. We should tune it again.
  • Text selection in Preview doesn't work reliably; it's easy for selection to pass through a gap between tokens and start selecting a background image instead.
    • this can happen naturally from white space between tokens in the original text...
    • or from spaces that result from low-confidence tokens. We may want to fill this with other text instead.
  • We're not writing spaces between tokens in the PDF, and should be. This was masked by viewers auto-inserting spaces based on the gap between tokens, but that's not reliable, and when it fails theresultingtextishardtoread 1.

Footnotes

  1. The resulting text is hard to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant