- 138 - Table is not extracted and some text order was wrong.
- 135 - Problem with multiple columns in simple text.
- 134 - Exclude images based on size threshold parameter.
- 132 - Optionally embed images as base64 string.
- 128 - Enhanced image embedding format.
- New parameter
embed_images
(bool) embeds images and vector graphics in the markdown text as base64-encoded strings. Ignoreswrite_images
andimage_path
parameters. - New parameter
image_size_limit
which is a float between 0 and 1, default is 0.05 (5%). Causes images to be ignored if their width or height values are smaller than the corresponding fraction of the page's width or height. - The algorithm has been improved which determins the sequence of the text rectangles on multi-column pages.
- Change of the header identification algorithm: If more than six header levels are required for a document, then all text with a font size larger than body text is assumed to be a header of level 6 (i.e. HTML "h6" = "###### ").
- 112 - Invalid bandwriter header dimensions/setup.
- New parameter
ignore_code
suppresses special formatting of text in mono-spaced fonts. - New parameter
extract_words
enforcespage_chunks=True
and adds a "words" list to each page dictionary.
- Extended the list of known bullet point characters.
- 73 - bug in
to_markdown
internal function. - 74 - minimum area for images & vector graphics.
- 75 - Poor Markdown Generation for Particular PDF.
- 76 - suggestion on useful api parameters.
- Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored.
- The format of saved images can now be controlled via new parameter
image_format
. - Images can be stored in a specific folder via the new parameter
image_path
. - Images are not stored if contained in another image on same page.
- Images are not stored if too small: if width or height are less than 5% of corresponding page dimension.
- All text is always written. If
write_images=True
, text on images / graphics can be suppressed by settingforce_text=False
.
- 71 - Unexpected results in pymupdf4llm but pymupdf works.
- 68 - Issue with text extraction near footer of page.
- Improved identification of scattered text span particles. This should address most issues with out-of-sequence situations.
- We now correctly process rotated pages (see issue 68).
- 65 - Fix typo in
pymupdf_rag.py
.
- 54 - Mistakes in orchestrating sentences. Additional fix: text extraction no longer uses the
TEXT_DEHYPHENATE
flag bit.
- Improved the algorithm dealing with vector graphics. Vector graphics are now more reliably classified as irrelevant: We now detect when "strokes" only exist in the neighborhood of the graphics boundary box border itself. This is quite often the case for code snippets.
- 55 - Bug in helpers/multi_column.py - IndexError: list index out of range.
- 54 - Mistakes in orchestrating sentences.
- 52 - Chunking of text files.
- Partial fix for 41 / 40 - Improved page column detection, but still no silver bullet for overly complex page layouts.
- New parameter
dpi
to specify the resolution of images. - New parameters
page_width
/page_height
for easily processing reflowable documents (Text, Office, e-books). - New parameter
graphics_limit
to avoid spending runtimes for value-less content. - New parameter
table_strategy
to directly control the table detection strategy.