pymupdf4llm for multi-page table #3954

bjmvercelli · 2024-10-16T18:44:06Z

bjmvercelli
Oct 16, 2024

Hello, been trying to find a PDF parser tool that handles tables that starts and ends on different pages, without redeclaring columns.

Example:

Does pymupdf4llm handles this scenario?

Answered by JorjMcKie

Oct 17, 2024

No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:

Number of columns? No safe indicator!
In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
So remains checkin…

View full answer

JorjMcKie · 2024-10-17T14:56:04Z

JorjMcKie
Oct 17, 2024
Maintainer

No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:

Number of columns? No safe indicator!
In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
So remains checking equal data types in each of the columns as in previous table ... sorry: this is simply beyond any reasonable scope.

If you know that your tables are continuations, you can still join them by exporting each to a pandas DataFrame and then use pandas means to join them. There is an example script in the utilities repo.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pymupdf4llm for multi-page table #3954

{{title}}

Replies: 1 comment

{{title}}

Select a reply

pymupdf4llm for multi-page table #3954

bjmvercelli Oct 16, 2024

Replies: 1 comment

JorjMcKie Oct 17, 2024 Maintainer

bjmvercelli
Oct 16, 2024

JorjMcKie
Oct 17, 2024
Maintainer