Skip to content

PDF pipeline for creating training corpora (mainly for llm, multimodal and alignment horizontals)

License

Notifications You must be signed in to change notification settings

swiss-ai/data-PDF-pipeline

Repository files navigation

Repo for PDF parsing pipeline

Suggested methodology: estimate presence of scientific content (LaTeX) or tables on the pdf page (as these are the major reasons why simple parsers might result in a corrupted data), and route the parsing to the appropriate tool.

Suggested tools for simple layouts:

  • PyMuPDF, etc. (40 ms processing per page)

Suggested tools for complex layouts:

Currently there are two ways implemented to estimate complexity of the pdf document

  • LLM based (based on api from GPT-4 or Claude 3) (around 1 page per second)
  • Light-weight visual/textual models (trained for latex and table detection from visual/textual signal) (whole pipeline around 30 pages per second on GPU)

To perform llm-based inference consult notebooks/Example_llm.ipynb

For the light-weight, scalable annotation, download checkpoints of the models from https://drive.google.com/file/d/1cQCvW4JdETfO55zVvq6m5vEnDPaTwWzn/ and run script infer_structure.py

About

PDF pipeline for creating training corpora (mainly for llm, multimodal and alignment horizontals)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published