-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Word to HTML conversion #65
Comments
Coded skeleton of ideas for functions we could use to do this. In the 83 branch. The idea is to use python-docx. def analyze_document_structure(self):
doc = Document(self.docx_file)
for paragraph in doc.paragraphs:
# Check font size
if paragraph.style.font.size and paragraph.style.font.size.pt > 12:
# This is larger text
pass
# Check bold
if paragraph.style.font.bold:
# This is bold text
pass
# Check alignment
if paragraph.alignment == WD_ALIGN_PARAGRAPH.CENTER:
# This is center-aligned text
pass
# Check for page breaks
if paragraph.style.next_paragraph_style != paragraph.style:
# This might indicate a section break or page break
pass
def identify_page_numbers(self):
doc = Document(self.docx_file)
for section in doc.sections:
footer = section.footer
if footer.paragraphs:
for paragraph in footer.paragraphs:
if '{PAGE}' in paragraph.text:
# This section uses page numbers in the footer
pass
def detect_formatting_changes(self):
doc = Document(self.docx_file)
previous_format = None
for paragraph in doc.paragraphs:
current_format = {
'font': paragraph.style.font.name,
'size': paragraph.style.font.size,
'bold': paragraph.style.font.bold,
'italic': paragraph.style.font.italic,
'alignment': paragraph.alignment
}
if current_format != previous_format:
# Format change detected
pass
previous_format = current_format |
@lucakato The idea seems solid. A few notes:
My advice is to start with one formatting thing (like identifying page breaks or bold text) and test that before writing out the whole logic. |
@lucakato Could you also isolate these changes on a new and separate branch? I don't want to combine the Word changes with the date parsing fixes, etc until they're all finished. |
The text was updated successfully, but these errors were encountered: