Skip to content

Latest commit

 

History

History
22 lines (13 loc) · 1.14 KB

README.md

File metadata and controls

22 lines (13 loc) · 1.14 KB

Repo for PDF parsing pipeline

Suggested methodology: estimate presence of scientific content (LaTeX) or tables on the pdf page (as these are the major reasons why simple parsers might result in a corrupted data), and route the parsing to the appropriate tool.

Suggested tools for simple layouts:

  • PyMuPDF, etc. (40 ms processing per page)

Suggested tools for complex layouts:

Currently there are two ways implemented to estimate complexity of the pdf document

  • LLM based (based on api from GPT-4 or Claude 3) (around 1 page per second)
  • Light-weight visual/textual models (trained for latex and table detection from visual/textual signal) (whole pipeline around 30 pages per second on GPU)

To perform llm-based inference consult notebooks/Example_llm.ipynb

For the light-weight, scalable annotation, download checkpoints of the models from https://drive.google.com/file/d/1cQCvW4JdETfO55zVvq6m5vEnDPaTwWzn/ and run script infer_structure.py