Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Word to HTML conversion #65

Open
3 tasks
kwheelan opened this issue May 28, 2024 · 3 comments
Open
3 tasks

Improve Word to HTML conversion #65

kwheelan opened this issue May 28, 2024 · 3 comments
Labels
python Task to implement ACFR parsing on the backend

Comments

@kwheelan
Copy link
Collaborator

kwheelan commented May 28, 2024

  • Test a few different Word docs to see what needs to change
  • Create a more robust way to identify page numbers and page breaks
  • See if there's a way to detect larger font/bold text/center alignment without requiring the user to select Styles in the Word doc (maybe this will already work now that the text isn't in a table?)
@kwheelan kwheelan added this to the Conversion improvements milestone May 28, 2024
@kwheelan kwheelan added the python Task to implement ACFR parsing on the backend label May 28, 2024
@lucakato
Copy link
Collaborator

lucakato commented Sep 5, 2024

Coded skeleton of ideas for functions we could use to do this. In the 83 branch. The idea is to use python-docx.

    def analyze_document_structure(self):
        doc = Document(self.docx_file)
        for paragraph in doc.paragraphs:
            # Check font size
            if paragraph.style.font.size and paragraph.style.font.size.pt > 12:
                # This is larger text
                pass
            
            # Check bold
            if paragraph.style.font.bold:
                # This is bold text
                pass
            
            # Check alignment
            if paragraph.alignment == WD_ALIGN_PARAGRAPH.CENTER:
                # This is center-aligned text
                pass
            
            # Check for page breaks
            if paragraph.style.next_paragraph_style != paragraph.style:
                # This might indicate a section break or page break
                pass

    def identify_page_numbers(self):
        doc = Document(self.docx_file)
        for section in doc.sections:
            footer = section.footer
            if footer.paragraphs:
                for paragraph in footer.paragraphs:
                    if '{PAGE}' in paragraph.text:
                        # This section uses page numbers in the footer
                        pass

    def detect_formatting_changes(self):
        doc = Document(self.docx_file)
        previous_format = None
        for paragraph in doc.paragraphs:
            current_format = {
                'font': paragraph.style.font.name,
                'size': paragraph.style.font.size,
                'bold': paragraph.style.font.bold,
                'italic': paragraph.style.font.italic,
                'alignment': paragraph.alignment
            }
            if current_format != previous_format:
                # Format change detected
                pass
            previous_format = current_format

@kwheelan
Copy link
Collaborator Author

kwheelan commented Sep 9, 2024

@lucakato The idea seems solid. A few notes:

  • You won't be able to use python-docx exclusively, because I think that only lxml (current package) can parse the document tree to extract the Word comments (which have the XBRL tags). But you should be able to use some sort of hybrid approach with python-docx to identify the formatting.
  • But that means you'll have to write a function to map the Word formatting you're ID-ing to the converted HTML (ie. location in the document).
  • And you'll also have to write some functions to actually add the appropriate HTML tags for the ID-ed formatting

My advice is to start with one formatting thing (like identifying page breaks or bold text) and test that before writing out the whole logic.

@kwheelan
Copy link
Collaborator Author

kwheelan commented Sep 9, 2024

@lucakato Could you also isolate these changes on a new and separate branch? I don't want to combine the Word changes with the date parsing fixes, etc until they're all finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Task to implement ACFR parsing on the backend
Projects
None yet
Development

No branches or pull requests

2 participants