Lines of text are sometimes split into two #4033

KrishnaGole · 2024-11-08T16:47:47Z

KrishnaGole
Nov 8, 2024

Original issue: - #3653:
My question:

i @JorjMcKie I have reviewed your recovered lines script, my question is how to use this script, is it going to edit the pdf with recovered lines or after reading the we need to all the recovered lines function. I need to recovered the lines and make those lines a single block in PDF so later save all the properties in data frame. Below is my data frame object. Pleast help with saving the block text as recovered lines with applied html tags Could you please help?

Step 5: Group by block_id and concatenate HTML-formatted text
rows_with_html = []
for page_num, blocks in block_dict.items():
for block in blocks:
if block['type'] == 0: # Only text blocks
block_id = block['number']
block_text = [] # Collect text for this block
original_text = []
for line in block['lines']:
for span in line['spans']:
xmin, ymin, xmax, ymax = list(span['bbox'])
font_size = span['size']
text = span['text'].strip().replace('\n', '').replace('\r', '')
span_font = span['font']
color = span["color"]

                is_upper = "uppercase" in span_font.lower()
                is_bold = "bold" in span_font.lower()

                # Validate and format color value
                if isinstance(color, int):
                    font_color = f'#{color:06x}'  # Ensure it's a 6-digit hex
                elif isinstance(color, tuple) and len(color) >= 3:
                    font_color = f'#{color[0]:02x}{color[1]:02x}{color[2]:02x}'
                else:
                    font_color = '#000000'  # Fallback to black if invalid
                # Validate color length (should be 7 characters including #)
                if len(font_color) != 7 or not font_color.startswith('#'):
                    font_color = '#000000'  # Fallback to black if invalid

                if text.replace(" ", "") != "":
                    original_text.append(text)
                    text = unidecode(text)
                    tag_for_text = tag.get(round(font_size), 'span')  # Default to 'span' if not found

                    if (font_size > 14):
                        tag_for_text = 'h1'
                    elif is_bold and tag_for_text.startswith('h'):
                        tag_for_text = 'h2'
                    elif tag_for_text.startswith('h'):
                        tag_for_text = 'h3' 

                    if is_bold:
                        # Apply <b> tags only if it's bold and not a heading
                        text = f"<b>{text}</b>"
                    
                    # if is_upper:
                    #     text = f"<span style='text-transform:uppercase'>{text}</span>"
                    # Only execute if text is not None, not empty, and not whitespace
                    # if text and text.strip():
                    #     text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                    #     block_text.append(text_with_tag)
                    if tag_for_text != 'p':
                        text_with_tag = f"<{tag_for_text} style='display:inline; color:{font_color};'>{text}</{tag_for_text}>\n"
                        block_text.append(text_with_tag)
                    else:
                        block_text.append(text)  

        if not block_text or not block_text[0].startswith('<h'):
            rows_with_html.append((page_num, block_id, f"<p>{' '.join(block_text)}</p>", ' '.join(original_text)))
        else:
            rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<p></p>", ' '.join(original_text)))
        #rows_with_html.append((page_num, block_id, ' '.join(block_text) + "<br><br>", ' '.join(original_text)))

Create the final DataFrame

grouped_df = pd.DataFrame(rows_with_html, columns=['page_num', 'block_id', 'text', 'originalText'])
grouped_df.to_excel('test.xlsx')
return grouped_df

JorjMcKie · 2024-11-09T09:13:07Z

JorjMcKie
Nov 9, 2024
Maintainer

You did not attach an example file, so I'm assuming it is that 2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf named in the referenced other issue.
Here is the output of page 2 produced by

import pymupdf
import pathlib

doc = pymupdf.open("2024-06-18-6670a9f1447abe73af0e9179fda392cc.pdf")
page = doc[1]
text = page.get_text(sort=True)
pathlib.Path("print-lines.txt").write_text(text)

print-lines.txt - To me this looks good.

0 replies

JorjMcKie · 2024-11-09T09:15:11Z

JorjMcKie
Nov 9, 2024
Maintainer

But may be I am misinterpreting. Do you mean that table of elements thing?
Anyway, this is no bug but a Discussions item.

1 reply

KrishnaGole Nov 9, 2024
Author

@JorjMcKie

I can't attach the PDF for security reason but attaching the screenshot here:

Below screenshot is the generated text its vertically broken:

When I get the PDF in online editor I can see the line is broken into multiple blocks

I am trying to use below recover lines method suggested by you in other issue for split lines and it is recovering it but how I can use it as it is just a text I want the html tag also alone with text and wanted to save it as a data frame mention in the discussion description :
"""
PyMuP DF Demo Script

The script addresses the frequent problem that a page's words are not present
in reading order, as it may happen when text has been added to a non-empty
page.
We read the words of a page and use their coordinates for recovering line
content.
"""

import sys
import pymupdf
from statistics import median

def recover_lines(page):
"""Reconstitute text lines on the page by using the coordinates of the
single words.
"""
# extract words, sorted by bottom, then left coordinate
words = [
(pymupdf.Rect(w[:4]), w[4]) for w in page.get_text("words", sort=True, flags=0)
]
lines = [] # list of reconstituted lines
line = [words[0]] # current line
lrect = words[0][0] # the line's rectangle

# walk through the words
for wr, text in words:
    w0r, _ = line[-1]  # read previous word in current line

    # if this word matches top or bottom of the line, append it
    if abs(lrect.y0 - wr.y0) <= 3 or abs(lrect.y1 - wr.y1) <= 3:
        line.append((wr, text))
        lrect |= wr
    else:
        # output current line and re-initialize
        # note that we sort the words in current line first
        ltext = " ".join([w[1] for w in sorted(line, key=lambda w: w[0].x0)])
        lines.append((lrect, ltext))
        line = [(wr, text)]
        lrect = wr

# also append last unfinished line
ltext = " ".join([w[1] for w in sorted(line, key=lambda w: w[0].x0)])
lines.append((lrect, ltext))

# sort all lines vertically
lines.sort(key=lambda l: (l[0].y1))

# compute the middle value of line heights
median_lheight = median([l[0].height for l in lines])

text = lines[0][1]  # text of first line
y1 = lines[0][0].y1  # its bottom coordinate
for lrect, ltext in lines[1:]:
    distance = int(round((lrect.y0 - y1) / median_lheight))
    breaks = "\n" * (distance + 1)
    text += breaks + ltext
    y1 = lrect.y1

# return page text
return text

if name == "main":
filename = "WordingSpacedOutVertically.pdf"
doc = pymupdf.open(filename)
text = chr(12).join([recover_lines(page) for page in doc])
print(text)

Below is the example how the single line is breaking into multiple blocks
How to bring this broken line into one block as I need to wrap one block into one html tag:
Below column sequence
Page no, block no, text with html tag, text without html tag

JorjMcKie · 2024-11-09T15:06:14Z

JorjMcKie
Nov 9, 2024
Maintainer

We cannot deal with this without a reproducing file.
Did you try my example script? It should work.

5 replies

KrishnaGole Nov 10, 2024
Author

WordingSpacedOutVertically.pdf
@JorjMcKie Attached is the sample PDF and yes I have tried example and it is printing without vertical splitting out. But I have looked at the blocks, words are still splitting into multiple block, I need to get the html tag for each block and print them in html in one paragraph so it should be in same block after recovering the line with sort = true.
Below column sequence
Page no, block no, text with html tag, text without html tag

JorjMcKie Nov 10, 2024
Maintainer

Well, I am not sure what you expect. The function Page.get_text(sort=True) works as asserted. If single words appear out of reading sequence originally, then that is caused by the PDF creator.
The function cleans this up and produces a perfect output in natural reading sequence.

The method uses the output of get_text("words", sort=True) and reformats it. You could / should start from here to make your desired result.

KrishnaGole Nov 10, 2024
Author

@JorjMcKie Yup got it the only thing I want to know get_text("words", sort=True) only gives me text with \n separator, How Can I get the html tag along with natural reading sequence. If you guide me to that, it will be really helpful.

JorjMcKie Nov 10, 2024
Maintainer

Hm, still having trouble to understand your problem. Of course you cannot expect \n to properly work within HTML source.
And BTW the "words" output is a list of tuples - no line break contained anywhere. To cause HTML to produce a line break, use tag <br>.

KrishnaGole Nov 10, 2024
Author

Dear @JorjMcKie currently I am using the below code to snippet o get span[font] and other font properties, and using the same font to generate html:
for page_num, blocks in block_dict.items():
for block in blocks:
if block['type'] == 0: # Only text blocks
block_id = block['number']
block_text = [] # Collect text for this block
original_text = []
font_color = ''
for line in block['lines']:
for span in line['spans']:
xmin, ymin, xmax, ymax = list(span['bbox'])
font_size = span['size']
text = span['text'].strip().replace('\n', '').replace('\r', '')
span_font = span['font']
color = span["color"]

                    is_upper = "uppercase" in span_font.lower()
                    is_bold = "bold" in span_font.lower()

Now if I use Page.get_text(sort=True) then is it possible to get the fonts like the above?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lines of text are sometimes split into two #4033

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lines of text are sometimes split into two #4033

KrishnaGole Nov 8, 2024

Create the final DataFrame

Replies: 3 comments · 6 replies

JorjMcKie Nov 9, 2024 Maintainer

JorjMcKie Nov 9, 2024 Maintainer

KrishnaGole Nov 9, 2024 Author

JorjMcKie Nov 9, 2024 Maintainer

KrishnaGole Nov 10, 2024 Author

JorjMcKie Nov 10, 2024 Maintainer

KrishnaGole Nov 10, 2024 Author

JorjMcKie Nov 10, 2024 Maintainer

KrishnaGole Nov 10, 2024 Author

KrishnaGole
Nov 8, 2024

Replies: 3 comments 6 replies

JorjMcKie
Nov 9, 2024
Maintainer

JorjMcKie
Nov 9, 2024
Maintainer

KrishnaGole Nov 9, 2024
Author

JorjMcKie
Nov 9, 2024
Maintainer

KrishnaGole Nov 10, 2024
Author

JorjMcKie Nov 10, 2024
Maintainer

KrishnaGole Nov 10, 2024
Author

JorjMcKie Nov 10, 2024
Maintainer

KrishnaGole Nov 10, 2024
Author