Repo for PDF parsing pipeline

Suggested methodology: estimate presence of scientific content (LaTeX) or tables on the pdf page (as these are the major reasons why simple parsers might result in a corrupted data), and route the parsing to the appropriate tool.

Suggested tools for simple layouts:

PyMuPDF, etc. (40 ms processing per page)

Suggested tools for complex layouts:

Nougat (https://github.com/facebookresearch/nougat), most optimal open source model, still hallucinates sometimes and quite slow
Mathpix (https://mathpix.com/) best-in-quality, quite fast, but commercial

Currently there are two ways implemented to estimate complexity of the pdf document

LLM based (based on api from GPT-4 or Claude 3) (around 1 page per second)
Light-weight visual/textual models (trained for latex and table detection from visual/textual signal) (whole pipeline around 30 pages per second on GPU)

To perform llm-based inference consult notebooks/Example_llm.ipynb

For the light-weight, scalable annotation, download checkpoints of the models from https://drive.google.com/file/d/1cQCvW4JdETfO55zVvq6m5vEnDPaTwWzn/ and run script infer_structure.py

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
llm_processing		llm_processing
models		models
notebooks		notebooks
processing		processing
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hallucinations.py		hallucinations.py
infer_structure.py		infer_structure.py
llm_callbacks.py		llm_callbacks.py
parser.py		parser.py
requirements.txt		requirements.txt
testing_protocol.md		testing_protocol.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo for PDF parsing pipeline

About

Releases

Packages

Contributors 2

Languages

License

swiss-ai/data-PDF-pipeline

Folders and files

Latest commit

History

Repository files navigation

Repo for PDF parsing pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages