-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parameter labels in export_to_markdown() does not work #112
Comments
If you simply want to skip headers and footers, nothing should be needed, because it is the default behavior. It is indeed true that the labels filter is currently used only for "simple text", and other structures are included. We should fix it. If are looking for a way to discard the more complex structures (figures, tables, etc) you can also use the parameter |
hey,
I am processing many clinical trial reports in pdf format. All page numbers
and some footers and headers are still there with the default settings.
It appears that some footers and headers are removed but not all.
I would love to share an example report but sadly I cannot.
…On Tue, 17 Dec 2024 at 07:55, Michele Dolfi ***@***.***> wrote:
If you simply want to skip headers and footers, nothing should be needed,
because it is the default behavior.
It is indeed true that the labels filter is currently used only for
"simple text", and other structures are included. We should fix it.
If are looking for a way to discard the more complex structures (figures,
tables, etc) you can also use the parameter strict_text=True.
—
Reply to this email directly, view it on GitHub
<#112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFMQ2RYPH2SUIECFT6DHXL2F7KIPAVCNFSM6AAAAABTXRYWFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXG4ZDSNRTGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Here is a small fix for it #113. On the other hand, the case you describe looks more a false prediction of the layout model, which is responsible to identify the page footers and headers. |
I want to export selected parts from the docling document to markdown. (Titles and paragraphs, but NO footers, headers, ...)
I wanted to do this by calling
doc.export_to_markdown(labels = {"title","paragraph"})
But this does not work. eg. there are still tables returned but no paragraphs.
The text was updated successfully, but these errors were encountered: