pymupdf4llm for multi-page table #3954
-
Hello, been trying to find a PDF parser tool that handles tables that starts and ends on different pages, without redeclaring columns. Does pymupdf4llm handles this scenario? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
If you know that your tables are continuations, you can still join them by exporting each to a pandas DataFrame and then use pandas means to join them. There is an example script in the utilities repo. |
Beta Was this translation helpful? Give feedback.
No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table: