Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing report for ODT output #15

Open
gusbrs opened this issue Jul 2, 2018 · 5 comments
Open

Testing report for ODT output #15

gusbrs opened this issue Jul 2, 2018 · 5 comments

Comments

@gusbrs
Copy link

gusbrs commented Jul 2, 2018

As you asked (or as was my misunderstanding of your request :), I did some testing for ODT output with make4ht.
My approach here was to start from an actual working document of mine, with all the elements I usually employ, to reduce it to an actual smaller testing document which retained its complexity and elements. I’ve removed though nested tabular/makecell elements, for I wanted to test things with make4ht "vanilla".

Indeed, all testing was done with:

make4ht -f odt filename.tex

without any additional config or make files. And biber filename as appropriate, of course.

As for environment, tests were done with a full and up-to-date TeX Live 2018, with the current dev version of make4ht on a Linux Mint 18.3, also up-to-date.

The test files are available at: https://gist.github.com/gusbrs/36ea400945e7031096464a8f98e001b4
(Please download them and let me know when you’ve done so. As they were derived from a working document of mine, I don’t want to leave this publicly available.)

There are three files. The first one was built with the above intention in mind, and compiled and tested with pdflatex. Now, this file, as it is, is not really amenable to be built with make4ht. So I had to strip down some things to reach the second file which, as the first, is based on the scrartcl class. The third
test file, in turn, is a version of the second one with the standard article class.

What had to be removed from the full document to get results with make4ht

  • Some things break compilation:
    • url inside thanks:
      • I substituted \url for \texttt
    • non ascii characters in label:
      • the label \label{sec:Introdução} and corresponding reference leads to
        errors in compilation, so it was substituted with \label{sec:Introducao}
  • Some things don’t break compilation, but do break the ODT file:
    • This one seems tricky. The use of \nocite{*} leads to problems with
      ampersands in other parts of the document (and in the bibliography as
      well). (I have used biblatex-examples.bib for the test files).
    • What happens is that, with \nocite{*} uncommented, dully escaped \& in
      TeX input elsewhere end up in content.xml as raw &, thus breaking ODT
      output.
    • So I had \nocite{*} commented. But you can reproduce the error
      uncommenting it. You’ll see that LibreOffice will report error in some
      ampersands in a quote environment earlier in the document.
  • Some things neither break compilation nor the ODT file, but lead to damaged
    output:
    • Letting biblatex create hyperlinks in citations issues several warnings. But,
      more important than that, is that the document itself suffers from several
      problems, including:
      • missing abstract
      • spurious spaces in the middle of words
      • figures with wrong sizes
      • missing parts of citations (the parts that would be hyperlinked)
      • (probably more, final output indeed results quite damaged)
    • So I added hyperref=false to biblatex’s options.

With these changes, we have the second test file, which is compilable and produces reasonable (though improvable) output.

Log files (full piped terminal output) for both the second and third test files are available at: https://gist.github.com/gusbrs/f822630ffd09029871401fe54c3746a2

Comments on the second (scrartcl) ODT output

  • if title spans two lines in pdf output make4ht gobbles space between the
    lines ("toa" instead of "to a")
  • footnotemark placement of \thanks is not appropriate.
  • abstract environment doesn’t seem to be recognized
  • footnotemark of regular footnotes receive spurious space (space between
    punctuation and footnotemark)
  • footnotes’ text is not justified
  • \clearpage is not respected (I haven’t forgotten
    https://tex.stackexchange.com/q/435235/105447, of course. But, as you
    mentioned there that that solution breaks other things, I report it here as a
    standing issue)
  • small caps are converted to regular capitals
  • hyperlink of \nameref is placed after the content (and introduces spurious
    space in the process) ("Introdução_,")
  • quoting environment simply vanishes from output (following paragraph is
    gobbled in the process)
  • displayquote environments and variants are recognized as regular paragraphs
    (true csquotes is configured to use quoting environments), in the process
    paragraph breaks (empty lines) are gobbled
  • linespacing of quotation environment different from quote environment
  • as far as I can tell, the use of \ after an abbreviation point to avoid
    extra "end of sentence spacing" with frenchspacing is turned into a
    non-breaking space in ODT
  • I use a custom description environment to add notes to floats, named
    floatnotes. The environment appears at the end of figure environment, but
    line breaks within floatnotes and between it and the caption are
    gobbled. The entire floatnotes environment vanishes on table floats.
  • some tables receive extra lines (my guess so far is that this is some
    interaction with the use of multicolumn)
  • table borders (here done with booktabs) do not reach ODT output
  • in description environments a paragraph break is introduced between label
    and text
  • the conversion somehow tricks LibreOffice current language
    recognition. Wherever you place your cursor in the resulting ODT (no active
    selection), the status bar will show "multiple languages"
  • indentation in quotation environment and hanging indent in bibliography are
    very large
  • quotation environment is not justified
  • it appears hyphenation is disabled in the resulting ODT
  • quotation and quote seem to be rendered in a frame/box (I don’t know what
    it is, nor if it is desirable. But I can’t seem to be able to delete it in the
    resulting ODT.)

Comments on the third (article) ODT output

Here some things seem to work better:

  • abstract environment is recognized
  • the spurious space before footnotemark is no longer there

But pretty much everything else stands on the same ground.

Comments on the third (article) resulting content.xml

  • regular paragraphs of text come out reasonably clean, but other kinds of
    environments/paragraph styles come full of text:span environments (I won’t
    say this is an "issue", but it would be nice to have a cleaner
    content.xml. If it is possible for regular paragraphs, why not for the
    rest?)
  • Emacs XML mode reports content.xml as "invalid". LibreOffice seems to be OK
    with it (well, it opens the file but, as the confusion with the current
    language shows, probably not everything is OK) and I don’t know if Emacs would
    be an authoritative source on the matter, but some consistency check on
    content.xml might be welcome.
  • The gobbled content (as assessed in the ODT) is somehow in content.xml
    (including the quoting environment and the missing floatnotes
    environments), which suggests this is a consistency problem in
    content.xml. My guess though is that gobbled line and paragraph breaks are
    gone for good (but those are, of course, much less important).

Well, I hope this testing is useful. Thank you for the great work! And, as usual, I remain at your disposal for discussion and further testing.

@michal-h21
Copy link
Owner

Thanks, that it is quite massive report :o

I will need some time to process it, some issues may be quite hard to fix.

@michal-h21
Copy link
Owner

My first findings:

  • issue with \url inside \thanks: it is due to the fact that tex4ht process \title and \author commands using \edef and \url isn't expandable. I am afraid that we cannot fix it, but it is possible to use

    \noexpand\url{https://my.site.com/}
    

as a workaround

  • accented characters as labels - it is best to not use that, it is safe to use only ascii characters. it is possible to support it using Unicode engine. For example

    make4ht -ul -f odt geopoltest1.tex

seems to work and fixes next issue:

  • \protect in https:\protect/my.site.com/ - this seems to be inserted by Brazilian definitions for Babel, LuaTeX fixes that and the URL is correct.

  • regarding amperesands in the XML file, this was bug in make4ht, the filter that converts XML entities back to Unicode didn't take into accound forbidden characters that break XML validity. I've updated make4ht and it should work now.

  • there is an issue in bibliography that one record uses URL in the form of https://doi.org/10.1002/(SICI)1096-987X(199803)19:4<377::AID-JCC1>3.0.CO;2-P" so it contains < and > characters, which breaks XML. I think that it is a bug in the bib file, all URLs should be in safe form, like https://doi.org/10.1002%2F%28SICI%291096-987X%28199803%2919%3A4%3C377%3A%3AAID-JCC1%3E3.0.CO%3B2-P

  • there are lot of issues with footnotes, all of them are doubled. this seems like a issue with Koma script, it is OK with standard classes or in HTML. So I will need to investigate it more.

  • the eaten spaces are weird, this seems like a bug in the DVI processing, the space in the title is correct if I remove the \vspace command.

I will try to fix these and other isues later.

@michal-h21
Copy link
Owner

I've also found two entries in biblatex-example.bib which cause invalid XML - knuth:ct:related and knuth:ct:a. The ODT file can be opened after I removed them. This is definitely a bug in tex4ht.

@gusbrs
Copy link
Author

gusbrs commented Jul 4, 2018

@michal-h21 Nice to see things going that fast. Thank you very much! I'll be following attentively your comments here and, if need be, will comment back (So far, I have nothing to add to your observations). And, if you reach a point where you want me to test things again, just let me know.

@michal-h21
Copy link
Owner

today I've fixed some issues in tex4ht sources, in quest to make the resulting ODT file valid in the ODF validator. I've removed some DTD definitions that didn't really work, there are still some validation issues with math, but I think I am on a good path.

One huge success is that Word can now open the ODT file and display math, which it didn't support up until now. The issue was only wrong mime type in the file directory. It is really good that it is no longer necessary to fix the ODT file in LibreOffice.

On the negative side, pandoc cannot convert the ODT file, even if it is perfectly valid, it reports only:

Couldn't parse odt file.

This needs further investigation.

Bad thing is that with every fix I find more bugs, so there is still lot of things to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants