Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(classify):add OCR mode switching based on char overlap detection #1337

Closed
wants to merge 1 commit into from

Conversation

pangguosheng1106
Copy link

Motivation

(1) Observed

In some PDFs, bold fonts are present (some exhibit overlapping characters due to multiple renderings, while others do not show obvious character overlaps). When using the TXT mode to parse the content, partial duplication occurs, but the OCR mode works correctly. I believe that the MinerU framework's ability to automatically recognize and handle such cases is more intelligent.

Original PDF
case原图

TXT Mode Recognition Result
TXT_mode

OCR Mode Recognition Result
OCR_mode

(2) Error Location

I have identified that the cause of the error is the overlapping of characters in the span read by PyMuPDF, and the IOU of the overlapping parts of the bounding boxes (bbox) may not be very high.

(3) Example

[
  {
    "origin": [135.0, 150.39999389648438],
    "bbox": [135.0, 133.22799682617188, 153.0, 154.95399475097656],
    "c": ""
  },
  {
    "origin": [128.92857360839844, 150.47747802734375],
    "bbox": [128.92857360839844, 134.64581298828125, 146.5, 153.05772399902344],
    "c": ""
  },
  {
    "origin": [128.92857360839844, 150.47747802734375],
    "bbox": [128.92857360839844, 134.64581298828125, 146.5, 153.05772399902344],
    "c": ""
  },
  {
    "origin": [153.0, 150.39999389648438],
    "bbox": [153.0, 133.22799682617188, 171.0, 154.95399475097656],
    "c": ""
  },
  {
    "origin": [146.5, 150.47747802734375],
    "bbox": [146.5, 134.64581298828125, 164.07142639160156, 153.05772399902344],
    "c": ""
  },
  {
    "origin": [146.5, 150.47747802734375],
    "bbox": [146.5, 134.64581298828125, 164.07142639160156, 153.05772399902344],
    "c": ""
  }
]

(4) Attempted Methods

  • Filtering Characters with High IOU
    Directly filter out characters from the span where the IOU exceeds a certain threshold.
  • Identifying Overlapping Characters within a Span
    Determine if there is any overlap (IOU exceeding a certain threshold) between characters within a span, and add an overlap flag fixme=true. This flag is then used to trigger OCR correction.

Neither of the above methods effectively addresses this issue. Additionally, If the threshold we set is too high, duplicated content may still exist; if it is set too low, it may inadvertently affect normal content. For example, in the example above, the character '辐' appears only once in the original text, but it appears three times in the span. And its IOU list is [0.4210700432518626, 1.0],
So i believe that setting the IOU threshold is a challenging.

Modification

This commit primarily modifies the pdf_meta_scan class. It adds the function get_char_overlap_per_page to obtain statistics on character overlaps. When certain conditions are met, the OCR mode can be triggered.

Use cases

char_overlaps_case1.pdf

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Copy link
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


pangguosheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@myhloli
Copy link
Collaborator

myhloli commented Dec 20, 2024

我想着应该不用在分类的地方做这个,我们现在支持span级别的动态ocr,可以对检测到char重叠的span单独做ocr,这样处理更精细一些,对性能的影响也更小

#1338

@myhloli myhloli closed this Dec 20, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 20, 2024
@myhloli myhloli added bug Something isn't working good first issue Good for newcomers labels Dec 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants