-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass more arguments to pdftotext #66
Comments
This library uses the poppler cpp interface, so you would first have to check if it exposes the functionality you desire. |
But the underlying pdftotext is from http://poppler.freedesktop.org! It is a different pdftotext engine! options from there:
But: I tested it with some pdf from my bank, to read old transactions. The option You can of course call the XpdfReader from python. but then you would not need https://pypi.org/project/pdftotext/. As jalan wrote, we have to look at the poppler interface. There we find: https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/cpp/poppler-page.cpp#L282 This could be a starting point. |
@jalan Hi, I'm using this opened issue to suggest a few additions that would considerably widen the usage scope of this library. pdftotext (poppler) does seems to expose the following parameters, at least via command-line: Usage: pdftotext [options] <PDF-file> [<text-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-r <fp> : resolution, in DPI (default is 72)
-x <int> : x-coordinate of the crop area top left corner
-y <int> : y-coordinate of the crop area top left corner
-W <int> : width of crop area in pixels (default is 0)
-H <int> : height of crop area in pixels (default is 0)
-layout : maintain original physical layout
-fixed <fp> : assume fixed-pitch (or tabular) text
-raw : keep strings in content stream order
-nodiag : discard diagonal text
-htmlmeta : generate a simple HTML file, including the meta information
-enc <string> : output text encoding name
-listenc : list available encodings
-eol <string> : output end-of-line convention (unix, dos, or mac)
-nopgbrk : don't insert page breaks between pages
-bbox : output bounding box for each word and page size to html. Sets -htmlmeta
-bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta
-cropbox : use the crop box rather than media box
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files) The interesting ones that are missing but would be very helpful are:
I believe this would be the equivalent via command line of NOT setting either
Here is the ...
<page width="595.000000" height="841.000000">
...
<flow>
<block xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
<line xMin="17.281000" yMin="276.220000" xMax="127.441000" yMax="285.156000">
<word xMin="17.281000" yMin="276.220000" xMax="65.249000" yMax="285.156000">Blabla</word>
<word xMin="67.465000" yMin="276.220000" xMax="76.793000" yMax="285.156000">blaa</word>
<word xMin="79.009000" yMin="276.220000" xMax="127.441000" yMax="285.156000">balbla/word>
</line>
</block>
</flow> This could certainly be imported in Python via a list of tuples, something like this: class WordBox(NamedTuple):
x0: int
y0: int
x1: int
y1: int
word: str
flow: int # dunno what flow really is however
block: int # would index blocks in the order they appear. Each word belongs to a block
line: int # same for lines Ideally, a page object in this case could contain some meta-info about the page (such as dimensions and page number) and the possibility to extract the list of words and their bounding box. I can certainly extract all this info myself by calling |
I have been meaning to fix the layout regarding |
Awesome! Thanks for your work on this library. Just to give an example of how I parse the output of def pdftotext_bbox_parse(content_box: str) -> list[PageWordBox]:
"""
Given output of `pdftotext -bbox-layout`, parse & retrieve positional information.
Parses the following kind of output from pdftotext:
<head>
</head>
<body>
<doc>
<page width="595.000000" height="841.000000">
<flow>
<block xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
<line xMin="277.060000" yMin="1.890400" xMax="520.033120" yMax="17.439040">
<word xMin="277.060000" yMin="1.890400" xMax="361.276000" yMax="17.439040">Blablaa</word>
<word xMin="365.380000" yMin="1.890400" xMax="392.454400" yMax="17.439040">blaaa</word>
<word xMin="396.340000" yMin="1.890400" xMax="427.228480" yMax="17.439040">blaa</word>
<word xMin="431.140000" yMin="1.890400" xMax="520.033120" yMax="17.439040">blaaaahhh</word>
</line>
</block>
<block xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="54.176040">
<line xMin="187.540000" yMin="17.969400" xMax="580.148800" yMax="26.816040">
...
"""
soup = BeautifulSoup(content_box, features="lxml")
pages: list[PageWordBox] = []
idx_page, idx_flow, idx_block, idx_line, idx_word = -1, -1, -1, -1, -1
for cur_page in soup.find_all("page"):
idx_page += 1
page = PageWordBox(n=idx_page, dim=cur_page.attrs)
pages.append(page)
for cur_flow in cur_page.find_all("flow"):
idx_flow += 1
flow = FlowWordBox(n=idx_flow, page=idx_page)
page.flows.append(flow)
for cur_block in cur_flow.find_all("block"):
idx_block += 1
block = BlockWordBox(n=idx_block,
flow=idx_flow,
box=Rectangle(*(float(n) for n in cur_block.attrs.values())))
flow.blocks.append(block)
page.blocks.append(block)
for cur_line in cur_block.find_all("line"):
idx_line += 1
line = LineWordBox(n=idx_line,
block=idx_block,
box=Rectangle(*(float(n) for n in cur_line.attrs.values())))
block.lines.append(line)
flow.lines.append(line)
page.lines.append(line)
for cur_word in cur_line.find_all("word"):
idx_word += 1
word = WordBox(
*(float(n) for n in cur_word.attrs.values()),
s=cur_word.text,
flow=idx_flow,
block=idx_block,
line=idx_line,
n=idx_word)
line.words.append(word)
block.words.append(word)
flow.words.append(word)
page.words.append(word)
return pages
|
I have created #83 to track fixing the layout options. This issue can remain to discuss adding any other options. I am not likely to add more options, as I just want a fast and easy way to get all the text from a PDF, such as for text mining or searching. There are plenty of more featureful PDF libs out there for doing fancier things, like pdfminer, pypdf2, pymupdf, pikepdf, and probably more since last I checked. |
Understood, thanks. I'm actually already using PyMuPDF and it's great, but seems to lack the layout-related options in pdftotext, so for me they complement each other. In case anyone needs, here is how I read pdftotext output when handled via command line: def pdftotext_cli(path: str | Path, page_num: int | None = None, args: list[str] | None = None) -> str:
"""
Example usage -> read the second page of PDF and return `-bbox-layout` information
>>> pdftotext_cli(Path("/path/to/file.pdf"), page_num=2, args=["-bbox-layout"]))
"""
if isinstance(path, str):
path = Path(path)
if not path.is_file:
raise RuntimeError(f"Given path not a (pdf) file: {path!r}")
page_arg = ("-f", str(page_num), "-l", str(page_num),) if page_num else []
args = args or []
with tempfile.NamedTemporaryFile() as temp_file:
_ = subprocess.run(["pdftotext", str(path.absolute()), temp_file.name,
*page_arg,
*args],
check=True,)
content = temp_file.read().decode()
return content |
I would like the "nodiag" (and "layout" wich is already implemented) option in the pdftotext library Usage: pdftotext [options] [] It seems Poppler already provides this feature: TextOutputDev.cc |
I might be late, but you guys can install pyxpdf (pip install pyxpdf) for all possible arguments provided by xpdf. |
I am not sure whether it actually makes sense to have such a large request laying around here asking about lots of options, while it is not really clear which actually are available already. Wouldn't it make more sense to track the relevant and missing parts in dedicated, smaller issues? |
Hi @jalan is it possible to retrieve only some pages of the pdf. I don't want to retrieve everything and then filter only the pages that I want. I would like to optimize that. Can you tell me if there is a way to this please ? |
When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction |
Hi,
I compared pymupdf and pdftotext years ago and I realized that the text extracted from pdftotext was better than pymupdf. That’s why since I only use pdftotext for pdf text extraction. But I will try again.
Thank you for your help.
Yasmina.
…________________________________
De : Stefan ***@***.***>
Envoyé : Thursday, March 2, 2023 1:29:07 PM
À : jalan/pdftotext ***@***.***>
Cc : YasminaFr ***@***.***>; Mention ***@***.***>
Objet : Re: [jalan/pdftotext] Pass more arguments to pdftotext (#66)
When already using pymupdf, there should be no need to run pdftotext afterwards in theory (and not even using a temporary file), as pymupdf has native support for this itself: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction
—
Reply to this email directly, view it on GitHub<#66 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEW5HXSO5V6H2MOB774XLA3W2CHBHANCNFSM4OILXHLA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
You can add this comment in ReadMe as well, for future references. |
@Ekran did you find a solution, to pass layout argument to true, I tried this Unfortunately, I got TypeError: 'layout' is an invalid keyword argument for this func |
@ahmed-bhs pdf = pdftotext.PDF(f, physical=True) https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html |
Yeah exactly, thank you so mush @benjamin-awd |
First of all, thanks for the handy module!
I'd be interested in having access to more of the features offered by pdftotext/xpdf to tune the quality of the extracted text.
As far as I know it is not possible to pass arguments freely to pdftotext but there are a few hardcoded parameters (password, raw).
Would that be something you would be open to add?
I'm not fluent in C++ but it seems that I could get inspiration from the existing code to try to have my arguments in.
The parameters/options in most interested in are nodiag, lineprinter, linespacing and fixed. The full list can be found here: http://www.xpdfreader.com/pdftotext-man.html
The text was updated successfully, but these errors were encountered: