Rough draft of LaTex PDF version the manual

urklang · May 27, 2019, 11:56pm

I made a rough draft PDF version of the latest manual by converting from HTML to LaTex with Pandoc:

There are some of the usual issues you would expect from an automated conversion: tables are a little wonky, there are broken characters/glyphs here and there. I’m pleasantly surprised with the result for how little time it took though. Images are mostly working. Links are mostly working as well.

If there’s interest, I’ll work on some of the rough edges and try to get it into the manual repo so we can have a generated PDF version in the future.

Headwar · May 28, 2019, 8:20am

Hi,

First, I must say your version looks definitely better than those I’ve had with e.g. wkhtmltopdf or any of those html2pdf softwares, so kudos to you !

I am also working towards generating a PDF automatically each time the doc is updated (hence the 1-page html version of the manual that you probably used to create this document, and the CSS simplification of the last few months). Though, I’ve come to realize that what remained to be fixed couldn’t, as HTML is definitely not a “physical document” format. There are a few CSS tricks, like paged medias, that would probably help but we are a long way from getting a serious looking PDF from what we have. Handling tables, floated screencaps and customized styles, as well as having a proper PDF hierarchy, are all going to be a lot of work, if even feasible. We could migrate the whole documentation from the custom Python scrips we are using now to e.g. Sphinx (that produces nice PDFs and has an integrated search) but that needs converting the whole documentation to reStructuredText, and would need approval from all the team.

What would be left for you to do is a lot of work on all the HTML/LaTeX to have a nice PDF, and this document (and all your work) would be obsoleted by the first doc update, unless you find a way to automate the nice-PDF generation.

Regards, and again, congrats for the doc even as-is, don’t let me discourage you !
-Edouard

urklang · May 28, 2019, 10:28am

You’re exactly right that I just used the 1-page html version as the input. From there it’s basically a one-liner with Pandoc. There was really only one place that had to be adjusted manually: a nested html table. Pandoc chokes on that. So basically the generated PDF could track the HTML changes in an automated fashion; I was thinking of just adding a build step. We could have a simple filter at the LaTex level to resolve any major issues or to apply stylistic changes, etc. Pandoc has the capability of converting to various other formats as well. There might be some useful intermediate format we could use as part of any filtering before rendering out to LaTex.

Of course as you mention there’s probably a more general solution where all the content is in some format that could be rendered consistently both to HTML and to PDF. Sphinx looks like an interesting possibility.

_FrnchFrgg · May 28, 2019, 8:05pm

The first problem I can notice is that Ardour’s single page manual uses <h1> tags for both sections (1. xxxx) and subsections (1.1. xxxxx). This means Pandoc will generate \sections for both of them even though they are not on the same level.
Also, Pandoc seems to ignore the fact that LaTεX \sections are auto-numbered by default, so you get both the auto-numbering and the numbering that is already in the HTML (it could be auto-generated by CSS I think, but it is not the chosen solution).
Not very hard to neuter the section numbering, but still.
All in all, Pandoc’s LaTεX output is not very good: it produces pleasing results, but is mostly intended for intermediary format to convert to PDF. The produced LaTεX is full of manual tweaks, cheap tricks and other stuff instead of being a semantic translation of the HTML, which could then be formatted correctly by the TεX engine.
As soon as I tried using one of my own LaTεX classes, which has better typography and layout than the default ones and of course a lot of things are difficult to handle (like the fact that it translates HTML tables to l tabular columns, even for multi-paragraph cell contents.

One of my pet projects is to write a package in TeX that does the complete HTML/CSS layout of tables. I have implemented the cellprops package, which can apply paddings, borders, backgrounds and min/max heights to tabular cells, but it still doesn’t break free from the LaTεX model where either you have a single-line auto-width (x)or a multi-line fixed-width cell. Until TεX has full HTML table layout, Pandoc translations of such tables will poorly fit into the TεX world.

urklang · May 28, 2019, 9:50pm

That’s an interesting point about Pandoc’s internals. It does feel like there is just a fundamental mismatch between HTML/CSS and TeX and that you’ll always have compromises trying to cleanly move between them without having planned for it ahead of time when writing the content. Honestly, I’m surprised that Pandoc’s output is as good as it is, but I’m not as attuned to the TeX-world conventions.

One approach might be to simply not expect to use tables on the Tex side for every HTML table. At the bottom of http://manual.ardour.org/ardours-interface/selection-and-punch-clocks/ for example there’s a table with basically a paragraph of text in the right hand cell. This seems to happen a lot in the HTML manual. Pandoc doesn’t handle this situation well at all and allows the table text to flow out of the right margin.

seablade · May 28, 2019, 10:15pm

Not sure how true this is, or namely it could be true of a default conversion of source that is not what is expected. But Pandoc itself is VERY flexible in what it can do, including running filters IIRC before and after translations so that sources can be cleaned up to create good input, resulting in decent output. Then again I usually format from markdown to PDF or HTML in Pandoc more than anything else.

Not sure what you expect to have happen here though, it has no way of knowing that

<h1> ## TEXT HERE </h1>

Should be translated to

\section TEXT HERE

Without the ## number that is part of the content. That is exactly why filters are needed to be utilized in this. I used to do that myself before feeding to Pandoc by using bash pipes and sed IIRC, though I believe Pandoc has the ability to create custom filters now and call them from Pandoc’s command line. Though truthfully I haven’t needed to do any of that in some time as the options in the header of markdown text has provided I lot of possibilities, so my knowledge isn’t as up to date as it could be there.

So in short, Pandoc is a great swiss army knife, but understand you are going to have to tweak it to make it look good in all outputs, it just allows you to do a lot of this tweaking easily. However it is probably going to be best to start from a single format and generate all outputs from it, most people do use Markdown for this, but if something else like HTML or Latex are the starting point then you need to be aware of limitations and generate your content for this purpose to start with, or else you are going to be using filters to take the content to a good input before feeding it into Pandoc.

    Seablade

_FrnchFrgg · May 29, 2019, 6:57am

That’s exactly what I call “LaTεX markup whose goal is to produce a PDF result instead of being reusable directly”. The choice is not bad per se, but by default Pandoc produces a “non-standalone” version, which is nothing more than what would be between \begin{document} and \end{document} in the “standalone” version, where they make sectioning commands unnumbered always (by setting the secnumdepth counter IIRC).

If you look at the Pandoc-generated preamble, you’ll understand what I mean by “full of tweaks”. At least it was like that last I looked.

I know I can change the results (even in TεX itself), but some things cannot be done meaningfully without changing the source (like the duplicate h1 level in the hierarchy)

The table translation is harder to solve in a generic manner. It might depend on the table contents, and thus would be hard to automate.

seablade · May 29, 2019, 1:52pm

Well yes, standalone is the default, but that is a simple matter of adding ‘-s’

You realize that you can override that right? The default template is one thing, but you can provide your own template just as easily as a file that you tell Pandoc to use. This is how I customize the LaTeX output to create essentially stationary/letterhead on my outputted documents for my business, or change the appearance for research reports (Also for my business technically) as needed.

Most of those tweaks are often embedded within IF statements that would normally be determined by the YAML preamble of markdown docs, or specified on the commandline.

But seriously, the entire preamble is just a template, and you can specify a completely different template easily including telling it not to number sections at all if that is your issue. I can tell you though that most of their tweaks in the default have a reason, even if it isn’t obvious at first. And yes the more common output is probably to PDF instead of LaTeX itself, though LaTeX is used as an intermediary in there. Personally I am not sure I would ever consider LaTeX completely readable as is anyways personally, at least compared to Markdown.

      Seablade

_FrnchFrgg · May 29, 2019, 4:10pm

No, what I meant is that the ouput produced by Pandoc in the document relies on some choices that are made in the default template, and that they are harder to circumvent sometimes (like preventing the use of longtables for instance). I’m sure it is possible to do, since one has access to the internal AST of Pandoc anyway, but that’s not super fun either.

I like Pandoc, by the way, and use it regularly. But to me, the LaTεX exporter is perfectible, especially when converting MathML from the importer used for LibreOffice or MSOffice (the actual problems are shared between the MathML importer and the TεX exporter).

seablade · May 29, 2019, 5:02pm

Ahh I understand your issue now.

The problem is that there are so many ways to accomplish things sometimes in LaTeX that I doubt you could ever pick one option and everyone be happy with it. The use of longtable for instance, some people will want it, some people would want a different solution, and no matter what you do not everyone will be happy. I can find similar examples of this throughout product development in general I suppose.

Yes if you want to remove the use of longtable it is more difficult, requiring writing a new output from Pandoc’s internal language to LaTeX or writing a preamble template with a lot of hacks (def’ing the commands used from longtable to the appropriate substitute etc.). But I am not convinced that this by itself should disqualify Pandoc from much of anything honestly, it is just that you are going to run into these issues or similar no matter what solution you choose when converting between formats I think.

In the end the best results will come from creating the source with the goals in mind of multiple formats I think.

     Seablade

urklang · May 30, 2019, 1:20am

This feels like a perfect being the enemy of the good type situation to me. Just having a PDF, any PDF, that is more or less readable is better than having nothing, especially one that can be generated automatically from what we have already. Hence hacking around with Pandoc as a quick experiment. Having the cleanest possible or even readable LaTeX output is not critical for me. Or “correct” usage of LaTeX features or agreement with LaTeX conventions. Ideally, the content writer could write in the most generic possible way without having to think much about the rendering details (e.g. someone can edit Wikipedia at a basic level without knowing about PHP/mediawiki/HTML/CSS). That’s a possible way forward actually, to make the manual editable Wiki style rather than requiring someone to clone the repo.

I think we’re all basically in agreement though that converting from complex HTML/CSS to LaTeX directly is never going to be completely optimal; we’ll need to extract the content and have different rendering backends. Which again is probably already mostly solved if we choose the right authoring format / toolchain.

BTW check this out: https://github.com/jgm/pandoc/wiki/Pandoc-Filters

_FrnchFrgg · May 31, 2019, 9:32am

I agree. But currently the way the tables are exported make a lot of them unreadable (with lines that go beyond the right of the page). It can be changed by hacking a new longtable environment that tries to be smart about it, or making Pandoc use another command. Without some markup in the actual source to tell if this is a long column or not, being “smart” about it can be hard to do: look at the code in web browsers to implement table layout.

x42 · May 31, 2019, 3:22pm

What is the eventual goal here? Create a printable book of the reference manual, or provide a convenient, serachable document for offline reading?

urklang · June 1, 2019, 1:13am

For me personally, having a printable book would be a secondary goal. Being searchable, legible and available offline are primary.

x42 · June 1, 2019, 11:44am

I suppose an offline HTML version of the manual could also provide that.

urklang · June 1, 2019, 1:37pm

Definitely. I think the HTML manual is well done. It works well for progressing through the material tutorial-style. But I like the feel of having a single linear document to rapidly look something up. Especially when there is as much content as we have. I find myself clicking through the navigation a lot trying to find a particular section or jumping out to google.