Change Log

Changes in version 0.0.15

Fixes:

138 - Table is not extracted and some text order was wrong.
135 - Problem with multiple columns in simple text.
134 - Exclude images based on size threshold parameter.
132 - Optionally embed images as base64 string.
128 - Enhanced image embedding format.

Improvements

New parameter embed_images (bool) embeds images and vector graphics in the markdown text as base64-encoded strings. Ignores write_images and image_path parameters.
New parameter image_size_limit which is a float between 0 and 1, default is 0.05 (5%). Causes images to be ignored if their width or height values are smaller than the corresponding fraction of the page's width or height.
The algorithm has been improved which determins the sequence of the text rectangles on multi-column pages.
Change of the header identification algorithm: If more than six header levels are required for a document, then all text with a font size larger than body text is assumed to be a header of level 6 (i.e. HTML "h6" = "###### ").

Changes in version 0.0.13

Fixes

112 - Invalid bandwriter header dimensions/setup.

Improvements

New parameter ignore_code suppresses special formatting of text in mono-spaced fonts.
New parameter extract_words enforces page_chunks=True and adds a "words" list to each page dictionary.

Changes in version 0.0.11

Fixes

90 - 'Quad' object has no attribute 'tl'.
88 - Bug in is_significant function.

Improvements

Extended the list of known bullet point characters.

Changes in version 0.0.10

Fixes

73 - bug in to_markdown internal function.
74 - minimum area for images & vector graphics.
75 - Poor Markdown Generation for Particular PDF.
76 - suggestion on useful api parameters.

Improvements

Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored.
The format of saved images can now be controlled via new parameter image_format.
Images can be stored in a specific folder via the new parameter image_path.
Images are not stored if contained in another image on same page.
Images are not stored if too small: if width or height are less than 5% of corresponding page dimension.
All text is always written. If write_images=True, text on images / graphics can be suppressed by setting force_text=False.

Changes in version 0.0.9

Fixes

71 - Unexpected results in pymupdf4llm but pymupdf works.
68 - Issue with text extraction near footer of page.

Improvements

Improved identification of scattered text span particles. This should address most issues with out-of-sequence situations.
We now correctly process rotated pages (see issue 68).

Changes in version 0.0.8

Fixes

65 - Fix typo in pymupdf_rag.py.

Changes in version 0.0.7

Fixes

54 - Mistakes in orchestrating sentences. Additional fix: text extraction no longer uses the TEXT_DEHYPHENATE flag bit.

Improvements

Improved the algorithm dealing with vector graphics. Vector graphics are now more reliably classified as irrelevant: We now detect when "strokes" only exist in the neighborhood of the graphics boundary box border itself. This is quite often the case for code snippets.

Changes in version 0.0.6

Fixes

55 - Bug in helpers/multi_column.py - IndexError: list index out of range.
54 - Mistakes in orchestrating sentences.
52 - Chunking of text files.
Partial fix for 41 / 40 - Improved page column detection, but still no silver bullet for overly complex page layouts.

Improvements

New parameter dpi to specify the resolution of images.
New parameters page_width / page_height for easily processing reflowable documents (Text, Office, e-books).
New parameter graphics_limit to avoid spending runtimes for value-less content.
New parameter table_strategy to directly control the table detection strategy.