Improve Word to HTML conversion #65

kwheelan · 2024-05-28T00:49:54Z

Test a few different Word docs to see what needs to change
Create a more robust way to identify page numbers and page breaks
See if there's a way to detect larger font/bold text/center alignment without requiring the user to select Styles in the Word doc (maybe this will already work now that the text isn't in a table?)

lucakato · 2024-09-05T19:16:34Z

Coded skeleton of ideas for functions we could use to do this. In the 83 branch. The idea is to use python-docx.

    def analyze_document_structure(self):
        doc = Document(self.docx_file)
        for paragraph in doc.paragraphs:
            # Check font size
            if paragraph.style.font.size and paragraph.style.font.size.pt > 12:
                # This is larger text
                pass
            
            # Check bold
            if paragraph.style.font.bold:
                # This is bold text
                pass
            
            # Check alignment
            if paragraph.alignment == WD_ALIGN_PARAGRAPH.CENTER:
                # This is center-aligned text
                pass
            
            # Check for page breaks
            if paragraph.style.next_paragraph_style != paragraph.style:
                # This might indicate a section break or page break
                pass

    def identify_page_numbers(self):
        doc = Document(self.docx_file)
        for section in doc.sections:
            footer = section.footer
            if footer.paragraphs:
                for paragraph in footer.paragraphs:
                    if '{PAGE}' in paragraph.text:
                        # This section uses page numbers in the footer
                        pass

    def detect_formatting_changes(self):
        doc = Document(self.docx_file)
        previous_format = None
        for paragraph in doc.paragraphs:
            current_format = {
                'font': paragraph.style.font.name,
                'size': paragraph.style.font.size,
                'bold': paragraph.style.font.bold,
                'italic': paragraph.style.font.italic,
                'alignment': paragraph.alignment
            }
            if current_format != previous_format:
                # Format change detected
                pass
            previous_format = current_format

kwheelan · 2024-09-09T17:53:47Z

@lucakato The idea seems solid. A few notes:

You won't be able to use python-docx exclusively, because I think that only lxml (current package) can parse the document tree to extract the Word comments (which have the XBRL tags). But you should be able to use some sort of hybrid approach with python-docx to identify the formatting.
But that means you'll have to write a function to map the Word formatting you're ID-ing to the converted HTML (ie. location in the document).
And you'll also have to write some functions to actually add the appropriate HTML tags for the ID-ed formatting

My advice is to start with one formatting thing (like identifying page breaks or bold text) and test that before writing out the whole logic.

kwheelan · 2024-09-09T17:55:49Z

@lucakato Could you also isolate these changes on a new and separate branch? I don't want to combine the Word changes with the date parsing fixes, etc until they're all finished.

kwheelan added this to the Conversion improvements milestone May 28, 2024

kwheelan added the python Task to implement ACFR parsing on the backend label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Word to HTML conversion #65

Improve Word to HTML conversion #65

kwheelan commented May 28, 2024 •

edited

Loading

lucakato commented Sep 5, 2024 •

edited

Loading

kwheelan commented Sep 9, 2024

kwheelan commented Sep 9, 2024

Improve Word to HTML conversion #65

Improve Word to HTML conversion #65

Comments

kwheelan commented May 28, 2024 • edited Loading

lucakato commented Sep 5, 2024 • edited Loading

kwheelan commented Sep 9, 2024

kwheelan commented Sep 9, 2024

kwheelan commented May 28, 2024 •

edited

Loading

lucakato commented Sep 5, 2024 •

edited

Loading