feat(classify):add OCR mode switching based on char overlap detection #1337
+89
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
(1) Observed
In some PDFs, bold fonts are present (some exhibit overlapping characters due to multiple renderings, while others do not show obvious character overlaps). When using the TXT mode to parse the content, partial duplication occurs, but the OCR mode works correctly. I believe that the MinerU framework's ability to automatically recognize and handle such cases is more intelligent.
Original PDF
![case原图](https://private-user-images.githubusercontent.com/29985149/397665451-90f3bfea-f57b-48c9-8181-f58f9f66006c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5NzAwNzEsIm5iZiI6MTczOTk2OTc3MSwicGF0aCI6Ii8yOTk4NTE0OS8zOTc2NjU0NTEtOTBmM2JmZWEtZjU3Yi00OGM5LTgxODEtZjU4ZjlmNjYwMDZjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDEyNTYxMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNmZTRjZWYzMDYyYWI0NWEyOWM5NmM3MWFlN2UzNjcyNGEzZTI1ZGFlN2I4MDNjZWRiMTIyNDJkMDU2MWMzNTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.hI7UyOuedYM1Vyq3O0OlKdWv2rcGFTNIm7uY_tTcPAA)
TXT Mode Recognition Result
![TXT_mode](https://private-user-images.githubusercontent.com/29985149/397665492-e02a800a-2ad9-49b9-87e0-94c728349f32.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5NzAwNzEsIm5iZiI6MTczOTk2OTc3MSwicGF0aCI6Ii8yOTk4NTE0OS8zOTc2NjU0OTItZTAyYTgwMGEtMmFkOS00OWI5LTg3ZTAtOTRjNzI4MzQ5ZjMyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDEyNTYxMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2NTQ2ZjQ2NTczMjRhYzlhNzI3MzUyOWNiZjFjNTg4NDJjMTM5MjljYjU5ZGZhMGQ2NWJiYzA4YzA1OGNmOWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.5Joy70IHeJvvNwX9FP279LNGdjU2P8NR0GqESJrD8HI)
OCR Mode Recognition Result
![OCR_mode](https://private-user-images.githubusercontent.com/29985149/397665530-e62888d9-f4f3-4e6a-a790-37325de6159c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk5NzAwNzEsIm5iZiI6MTczOTk2OTc3MSwicGF0aCI6Ii8yOTk4NTE0OS8zOTc2NjU1MzAtZTYyODg4ZDktZjRmMy00ZTZhLWE3OTAtMzczMjVkZTYxNTljLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE5VDEyNTYxMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYwZTc0OGNmNmY0MjYzZGFhMDQ4MGVlZjU4Mzg4MGE0NDNkOGRjZWE4MDc3MTIwNGU2MmEyYWQ5ZDI3MTUxMWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.S17n-iOtP5VA-8MWL7LwNCf4HXx0f_LfRCNe42lQNTQ)
(2) Error Location
I have identified that the cause of the error is the overlapping of characters in the span read by PyMuPDF, and the IOU of the overlapping parts of the bounding boxes (bbox) may not be very high.
(3) Example
(4) Attempted Methods
Directly filter out characters from the span where the IOU exceeds a certain threshold.
Determine if there is any overlap (IOU exceeding a certain threshold) between characters within a span, and add an overlap flag fixme=true. This flag is then used to trigger OCR correction.
Neither of the above methods effectively addresses this issue. Additionally, If the threshold we set is too high, duplicated content may still exist; if it is set too low, it may inadvertently affect normal content. For example, in the example above, the character '辐' appears only once in the original text, but it appears three times in the span. And its IOU list is
[0.4210700432518626, 1.0]
,So i believe that setting the IOU threshold is a challenging.
Modification
This commit primarily modifies the
pdf_meta_scan
class. It adds the functionget_char_overlap_per_page
to obtain statistics on character overlaps. When certain conditions are met, the OCR mode can be triggered.Use cases
char_overlaps_case1.pdf
Checklist
Before PR:
After PR: