Lines of text are sometimes split into two #4033
Replies: 3 comments 6 replies
-
You did not attach an example file, so I'm assuming it is that import pymupdf
import pathlib
doc = pymupdf.open("2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf")
page = doc[1]
text = page.get_text(sort=True)
pathlib.Path("print-lines.txt").write_text(text) print-lines.txt - To me this looks good. |
Beta Was this translation helpful? Give feedback.
-
But may be I am misinterpreting. Do you mean that table of elements thing? |
Beta Was this translation helpful? Give feedback.
-
We cannot deal with this without a reproducing file. |
Beta Was this translation helpful? Give feedback.
-
Original issue: - #3653:
My question:
i @JorjMcKie I have reviewed your recovered lines script, my question is how to use this script, is it going to edit the pdf with recovered lines or after reading the we need to all the recovered lines function. I need to recovered the lines and make those lines a single block in PDF so later save all the properties in data frame. Below is my data frame object. Pleast help with saving the block text as recovered lines with applied html tags Could you please help?
Step 5: Group by block_id and concatenate HTML-formatted text
rows_with_html = []
for page_num, blocks in block_dict.items():
for block in blocks:
if block['type'] == 0: # Only text blocks
block_id = block['number']
block_text = [] # Collect text for this block
original_text = []
for line in block['lines']:
for span in line['spans']:
xmin, ymin, xmax, ymax = list(span['bbox'])
font_size = span['size']
text = span['text'].strip().replace('\n', '').replace('\r', '')
span_font = span['font']
color = span["color"]
Create the final DataFrame
grouped_df = pd.DataFrame(rows_with_html, columns=['page_num', 'block_id', 'text', 'originalText'])
grouped_df.to_excel('test.xlsx')
return grouped_df
Beta Was this translation helpful? Give feedback.
All reactions