feat(classify)：add OCR mode switching based on char overlap detection #1337

pangguosheng1106 · 2024-12-20T08:24:02Z

Motivation

(1) Observed

In some PDFs, bold fonts are present (some exhibit overlapping characters due to multiple renderings, while others do not show obvious character overlaps). When using the TXT mode to parse the content, partial duplication occurs, but the OCR mode works correctly. I believe that the MinerU framework's ability to automatically recognize and handle such cases is more intelligent.

Original PDF

TXT Mode Recognition Result

OCR Mode Recognition Result

(2) Error Location

I have identified that the cause of the error is the overlapping of characters in the span read by PyMuPDF, and the IOU of the overlapping parts of the bounding boxes (bbox) may not be very high.

(3) Example

[
  {
    "origin": [135.0, 150.39999389648438],
    "bbox": [135.0, 133.22799682617188, 153.0, 154.95399475097656],
    "c": "辐"
  },
  {
    "origin": [128.92857360839844, 150.47747802734375],
    "bbox": [128.92857360839844, 134.64581298828125, 146.5, 153.05772399902344],
    "c": "辐"
  },
  {
    "origin": [128.92857360839844, 150.47747802734375],
    "bbox": [128.92857360839844, 134.64581298828125, 146.5, 153.05772399902344],
    "c": "辐"
  },
  {
    "origin": [153.0, 150.39999389648438],
    "bbox": [153.0, 133.22799682617188, 171.0, 154.95399475097656],
    "c": "射"
  },
  {
    "origin": [146.5, 150.47747802734375],
    "bbox": [146.5, 134.64581298828125, 164.07142639160156, 153.05772399902344],
    "c": "射"
  },
  {
    "origin": [146.5, 150.47747802734375],
    "bbox": [146.5, 134.64581298828125, 164.07142639160156, 153.05772399902344],
    "c": "射"
  }
]

(4) Attempted Methods

Filtering Characters with High IOU
Directly filter out characters from the span where the IOU exceeds a certain threshold.
Identifying Overlapping Characters within a Span
Determine if there is any overlap (IOU exceeding a certain threshold) between characters within a span, and add an overlap flag fixme=true. This flag is then used to trigger OCR correction.

Neither of the above methods effectively addresses this issue. Additionally, If the threshold we set is too high, duplicated content may still exist; if it is set too low, it may inadvertently affect normal content. For example, in the example above, the character '辐' appears only once in the original text, but it appears three times in the span. And its IOU list is [0.4210700432518626, 1.0],
So i believe that setting the IOU threshold is a challenging.

Modification

This commit primarily modifies the pdf_meta_scan class. It adds the function get_char_overlap_per_page to obtain statistics on character overlaps. When certain conditions are met, the OCR mode can be triggered.

Use cases

char_overlaps_case1.pdf

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

github-actions · 2024-12-20T08:24:17Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

pangguosheng seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.}

myhloli · 2024-12-20T09:38:44Z

我想着应该不用在分类的地方做这个，我们现在支持span级别的动态ocr，可以对检测到char重叠的span单独做ocr，这样处理更精细一些，对性能的影响也更小

#1338

feat(classify)：add OCR mode switching based on char overlap detection

c3da5a0

myhloli closed this Dec 20, 2024

github-actions bot locked and limited conversation to collaborators Dec 20, 2024

myhloli added bug Something isn't working good first issue Good for newcomers labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(classify)：add OCR mode switching based on char overlap detection #1337

feat(classify)：add OCR mode switching based on char overlap detection #1337

pangguosheng1106 commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

myhloli commented Dec 20, 2024

feat(classify)：add OCR mode switching based on char overlap detection #1337

feat(classify)：add OCR mode switching based on char overlap detection #1337

Conversation

pangguosheng1106 commented Dec 20, 2024

Motivation

(1) Observed

(2) Error Location

(3) Example

(4) Attempted Methods

Modification

Use cases

Checklist

github-actions bot commented Dec 20, 2024

myhloli commented Dec 20, 2024