Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1000 >= right >= left >= 0 and 1000 >= bottom >= top >= 0 AssertionError: Invalid box. right: 901, left: 583, bottom: 1000, top: 1002 #1879

Open
ywh-my opened this issue Mar 9, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@ywh-my
Copy link

ywh-my commented Mar 9, 2025

Description of the bug | 错误描述

跑的公司内部文件(类似于PPT格式,但是文字非常多,密集),不方便向您们传输这个文件。 最后报错了:
[03/09 12:19:59 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /...//AImodels/opendatalab/PDF-Ext
[03/09 12:19:59 fvcore.common.checkpoint]: [Checkpointer] Loading from /...//AImodels/opendatalab/PDF-Extract-Kit-1___0/mode
2025-03-09 12:20:00.168 | INFO | magic_pdf.model.pdf_extract_kit:init:174 - DocAnalysis init done!
2025-03-09 12:20:00.168 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:130 - model init cost: 4.70296573638916
2025-03-09 12:20:00.168 | INFO | main:setup:172 - mineru 模型初始化完毕
2025-03-09 12:20:00.168 | INFO | main:setup:178 - paddleocr 模型开始加载
2025-03-09 12:20:00.411685506 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution provider
2025-03-09 12:20:00.411713547 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show n
2025-03-09 12:20:00.486 | INFO | main:setup:182 - paddleocr 模型初始化完毕
2025-03-09 12:20:00.486 | INFO | main:setup:185 - 表格结构解析模型 模型开始加载
2025-03-09 12:20:00.557 | INFO | main:setup:188 - 表格结构解析模型 模型初始化完毕
2025-03-09 12:20:00.658 | INFO | main:task101_pdf_parse:224 - do parse , theoutput_dir:/...//AMinerU_SaveTemp/164620
2025-03-09 12:20:00.658 | INFO | main:task101_pdf_parse:225 - 当前 do parse的进程号:3271185
2025-03-09 12:20:00.658 | INFO | main:task101_pdf_parse:226 - 当前输入PDF文件字节大小:9.616772999999998 MB
2025-03-09 12:20:00.658 | INFO | main:task101_pdf_parse:227 - 开始解析PDF
2025-03-09 12:20:00.780 | INFO | magic_pdf.data.dataset:init:156 - lang: None
2025-03-09 12:20:00.781 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:186 - gpu_memory: 24 GB, batch_ratio: 16
2025-03-09 12:20:42.444 | INFO | magic_pdf.model.batch_analyze:call:74 - layout time: 34.65, image num: 89
2025-03-09 12:23:55.439 | INFO | magic_pdf.model.batch_analyze:call:195 - det time: 192.16, image num: 1099
2025-03-09 12:23:55.885 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:241 - gc time: 0.44
2025-03-09 12:23:55.885 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:245 - doc analyze time: 235.1, speed: 0.38 pages/second
2025-03-09 12:23:57.995 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:938 - 需解析页数:89
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 191, bottom: 206, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 206, bottom: 221, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 221, bottom: 236, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 236, bottom: 251, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 251, bottom: 266, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 266, bottom: 281, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 281, bottom: 296, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 296, bottom: 311, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 311, bottom: 326, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 326, bottom: 341, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 341, bottom: 356, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 356, bottom: 371, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 371, bottom: 386, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 386, bottom: 401, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 401, bottom: 416, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 416, bottom: 431, page_w
2025-03-09 12:24:01.371 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 431, bottom: 446, page_w
2025-03-09 12:24:01.372 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 446, bottom: 461, page_w
2025-03-09 12:24:01.372 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 552, right: 752, top: 461, bottom: 476, page_w
2025-03-09 12:24:02.420 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:547 - bottom > page_h, left: 32, right: 238, top: 477, bottom: 485, page_w
2025-03-09 12:24:02.933 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:547 - bottom > page_h, left: 3, right: 340, top: 469.0, bottom: 485.0, pag
2025-03-09 12:24:03.565 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 2, right: 752, top: 107, bottom: 231.666666666
2025-03-09 12:24:03.565 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 2, right: 752, top: 231.66666666666669, bottom
2025-03-09 12:24:03.565 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 2, right: 752, top: 356.33333333333337, bottom
2025-03-09 12:24:04.086 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 0, right: 752, top: 92, bottom: 219.3333333333
2025-03-09 12:24:04.086 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 0, right: 752, top: 219.33333333333331, bottom
2025-03-09 12:24:04.087 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 0, right: 752, top: 346.66666666666663, bottom
2025-03-09 12:24:06.321 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:547 - bottom > page_h, left: 9, right: 369, top: 477.5, bottom: 494.0, pag
2025-03-09 12:24:07.407 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 7, right: 752, top: 350, bottom: 389.333333333
2025-03-09 12:24:07.407 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 7, right: 752, top: 389.3333333333333, bottom:
2025-03-09 12:24:07.407 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:537 - right > page_w, left: 7, right: 752, top: 428.66666666666663, bottom
2025-03-09 12:24:08.991 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:547 - bottom > page_h, left: 438, right: 677, top: 469, bottom: 483, page_w: 751.2000122070312, page_h: 482.0400085449219
2025-03-09 12:24:08.992 | WARNING | magic_pdf.pdf_parse_union_core_v2:sort_lines_by_model:547 - bottom > page_h, left: 438, right: 677, top: 483, bottom: 497, page_w: 751.2000122070312, page_h: 482.0400085449219
2025-03-09 12:24:08.992 | INFO | main:task101_pdf_parse:371 - 高精度PDF解析出错: Invalid box. right: 901, left: 583, bottom: 1000, top: 1002
Traceback (most recent call last):
File "/...//ai_servers/server01_mineruBase_Litserve.py", line 229, in task101_pdf_parse
out = do_parse(
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 151, in do_parse
pipe_result = infer_result.pipe_txt_mode(
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 102, in pipe_txt_mode
res = self.apply(
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 70, in apply
return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 95, in proc
res = pdf_parse_union(*args, **kwargs)
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 951, in pdf_parse_union
page_info = parse_page_core(
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 869, in parse_page_core
sorted_bboxes = sort_lines_by_model(fix_blocks, page_w, page_h, line_height)
File "/home/kemove/miniconda3/envs/minerU120/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 557, in sort_lines_by_model
1000 >= right >= left >= 0 and 1000 >= bottom >= top >= 0
AssertionError: Invalid box. right: 901, left: 583, bottom: 1000, top: 1002

可能您需要检测检测框和PDF文件大小的问题。

How to reproduce the bug | 如何复现

不好意思不能提供pdf文件来复现。 mineru = 1.2.0

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

No response

Device mode | 设备模式

cuda

@ywh-my ywh-my added the bug Something isn't working label Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant