Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pandoc Meta data to Context #643

Open
flupe opened this issue Jun 15, 2018 · 18 comments · May be fixed by #780
Open

Add Pandoc Meta data to Context #643

flupe opened this issue Jun 15, 2018 · 18 comments · May be fixed by #780
Labels

Comments

@flupe
Copy link

flupe commented Jun 15, 2018

One interesting feature of ReST is that you can define document variables and meta data directly inside the document, without having to rely on some additional YAML header.

For example, the title of the document and its eventual subtitle can be inferred from the first headers of the file (see Markup Specification (Document Structure)),
and if the first non-comment element is a definition list, its fields update the bibliographic information of the document (see Markup Specification (Bibliographic Fields)).

Although the ReST Parser of Pandoc is far from perfect (it does not support custom directives or roles, and the Pandoc AST is quite restrictive), it does implement the aforementioned features, in standalone mode, and populates the Meta information of the Pandoc document.

Document Title
==============

Subtitle
--------

:author: flupe

However, it seems as though the Pandoc compilers provided by Hakyll completely ignore the meta information of the parsed Pandoc documents.

I don't really know if other markup languages supported by Pandoc also populate the meta information, but I do think it would be useful to provide an easier way to inject this meta information into Hakyll contexts.


For a custom site I've just set up, it is somehow working a little. Here is the relevant part (source):

 match "posts/*" $ do
    route $ setExtension "html"

    let ropts = defaultHakyllReaderOptions { readerStandalone = True }
        wopts = defaultHakyllWriterOptions

    compile $ do
        document <- getResourceBody >>= readPandocWith ropts

        let
            Pandoc meta _ = itemBody document

            inlinesToString :: [Inline] -> String
            inlinesToString inlines =
                concatMap inlineToString inlines
                where
                    inlineToString (Str a) = a
                    inlineToString (Space) = " "

            extractMeta :: String -> MetaValue -> Context a
            extractMeta name metavalue =
                case metavalue of
                    MetaInlines inlines -> mkField $ inlinesToString inlines
                    _ -> mempty
                where
                    mkField = field name . const . return

            ctx :: Context String
            ctx = foldMapWithKey extractMeta (unMeta meta)
                <> postCtx

        writePandocWith wopts document
            &   loadAndApplyTemplate "templates/post.html" ctx
            >>= loadAndApplyTemplate "templates/default.html" ctx
            >>= relativizeUrls

Essentially, we:

  1. Parse the post and get an Item Pandoc
  2. Create a new context ctx populated with the metadata from the document.
  3. Render the post with templates using the new context we created.

However, this mechanism in ReST implies that you actually need to parse the document to get the context, and I have no idea how to make this work well with other rules such as creating and index.html or archives.html page (without having to parse again for each rule).


In conclusion, here's what I am suggesting:

  • adding a proper way to transform Pandoc Meta into an Hakyll Context.
  • a way to associate this information and posts inside snapshots?
@LaurentRDC
Copy link
Collaborator

+1.

In my case, I have Pandoc filter that counts the number of words in blog posts, and deduces the time-to-read. This information is then stored in the Pandoc metadata. I would like to tap into this information via a template.

I'm willing to draft a pull request if someone can help me understand the steps required.

@ip1981
Copy link

ip1981 commented Jun 3, 2020

I have an idea which I'm going to try: extend the Provide type by adding a new field - providerMetadata :: FilePath -> IO Metadata (or something line this). And thus I will be able to use any routine to extract any metadata. They can be merged with existing.

@ip1981 ip1981 linked a pull request Jun 4, 2020 that will close this issue
@flupe
Copy link
Author

flupe commented Jun 4, 2020

Thank you for your work!

However I don't think it is the right solution for this specific issue just yet.
The thing is, because the document metadata is accessible from pandoc once the entire document has been parsed, you would ideally parse the file once, and make hakyll use both the document metadata and the parsed content from there on.
With your current solution, while it is true that you can specify a custom metadata provider, it still relies on parsing the document from scratch again and again. That is very slow.

A more appropriate solution would be to improve how files are loaded into the store (See load here: https://hackage.haskell.org/package/hakyll-4.13.3.0/docs/src/Hakyll.Core.Provider.MetadataCache.html#resourceMetadata) so that for pandoc documents, we can retrieve there the metadata and store some Item Pandoc instead of Item String (or whatever the internal pandoc representation type is called).

Given the lack of response from hakyll's maintainer @jaspervdj , I did not start working on this as I had very little hope for such a change to be merged, and thought that designing a proper solution required some more discussion. Might look at it in the future if I can find some time. Please tell me if I got your PR wrong.

@ip1981
Copy link

ip1981 commented Jun 4, 2020

Meh, Pandoc metadata is not trivial (I personally do not want to lose title formatting - from LaTeX), someone may use Hakyll without Pandoc, etc. After all I want a universal solution. Thus just a simple FilePath -> IO Metadata. Additionally, one may want to build metadata from the resource body and not rely on Pandoc's metadata.

@flupe
Copy link
Author

flupe commented Jun 4, 2020

Let me reformulate. I have no doubt your PR is useful and would love to see it merged. What I'm arguing is that it does not resolve this issue, hence I'd rather you did not put "Closes #643" in the PR comment.

As for providing only a generic solution and not an additional one for pandoc documents, I don't believe the argument that "someone may use Hakyll without Pandoc" to be sufficient. Hakyll is very much made to work well with pandoc (pandoc compilers would not have been included otherwise), and I think there is value to optionally ease the handling of pandoc metadata. The upvotes this issue received suggest other people are interested as well. The performance concern due to parsing at least twice every document still stands, and if we can do better by being less generic then so be it.

@flupe
Copy link
Author

flupe commented Jun 15, 2020

Closing this now. After @ip1981's comment and PR more than a week ago I started investigating whether it could be used as a starting point for solving this issue. Still, I was just trying to work against every abstraction Hakyll is using.

All in all Hakyll was simply not a generic enough tool for what I wanted (that's not bad per se!).
Ended up making my own library which you can hear more about here. It's very tiny and does everything I want. Closing the issue.

@flupe flupe closed this as completed Jun 15, 2020
@gnull
Copy link

gnull commented Jul 31, 2020

Still, I was just trying to work against every abstraction Hakyll is using.

@flupe, do you mean that you tried implementing that solution you suggested, which would make Hakyll work with an Item Pandoc instead of an Item String?

@gwern
Copy link
Contributor

gwern commented Aug 29, 2022

I think this issue should be unclosed. I just ran into a similar problem where I assumed that the Pandoc ASTs being passed into my transformation pipeline would have their YAML metadata, because why wouldn't they? I coded up a complete solution to my problem before running it and discovering that no, Hakyll strips all of the Pandoc metadata (why???) and I had to come up with an entirely different Hakyll approach. Saying that there is some other non-Hakyll library which does it differently is in no way a solution which closes this problem! (By that logic, you could close every Hakyll issue because there is presumably at least one tool out there which in some way doesn't have that issue...)

@Minoru
Copy link
Collaborator

Minoru commented Aug 29, 2022

@gwern Can you post a summary of your instance of the problem, so I can poke around and understand what pieces are involved? Then we can discuss possible solutions.

@Minoru Minoru reopened this Aug 29, 2022
@Minoru Minoru added the feature label Aug 29, 2022
@gwern
Copy link
Contributor

gwern commented Aug 29, 2022

gwern/gwern.net@9542a9a

It's fairly straightforward: I use a pandocTransformWith to run a bunch of Pandoc API transformations; for 'index' pages (pages which have index: true set in the YAML metadata eg neural net video generation bibliography), they are 'simpler' and the HTML template disables a bunch of stuff, and I thought I would disable several of the transformations as well because they are slow & cause some bugs. (This obviously can't be done at the final templating pass, because the HTML template is generated long after all of this has run; it has to be done inside the previous Compiler stage, in Hakyll-land.) So, since index: true is available in the Pandoc document type Pandoc as stored in the Data.Map.Map bundled with the actual [Blocks], and you just extract it with unMeta (p::Pandoc) and then look up the boolean variable, I thought I'd simply augment my transform pipeline with a quick lookup of index and then toggle the expensive transforms based on that. This is logical, typechecks, and runs perfectly. It's just that Hakyll erases the Pandoc metadata and you wind up with a metadata of just [], with all the original YAML values erased including index, and so the index-check never gets set to True and the expensive passes always run...

My preferred solution would be for Hakyll to simply not erase the original Pandoc metadata. Does anyone expect it to do that? You'd expect it to read a Pandoc as specified from the files specified, and it to not molest the Pandoc to erase the metadata or whatever is going on behind the scenes there.


A quick reminder of the relevant Pandoc types:

ghci> :i Pandoc
type Pandoc :: *
data Pandoc = Pandoc !Meta ![Block]
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Monoid Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Ord Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Semigroup Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Show Pandoc -- Defined in ‘Text.Pandoc.Definition’
instance Read Pandoc -- Defined in ‘Text.Pandoc.Definition’
ghci> :i Meta
type Meta :: *
newtype Meta
  = Meta {unMeta :: Data.Map.Internal.Map
                      Data.Text.Internal.Text MetaValue}
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Meta -- Defined in ‘Text.Pandoc.Definition’
instance Monoid Meta -- Defined in ‘Text.Pandoc.Definition’
instance Ord Meta -- Defined in ‘Text.Pandoc.Definition’
instance Semigroup Meta -- Defined in ‘Text.Pandoc.Definition’
instance Show Meta -- Defined in ‘Text.Pandoc.Definition’
instance Read Meta -- Defined in ‘Text.Pandoc.Definition’
ghci> :i MetaValue
type MetaValue :: *
data MetaValue
  = MetaMap !(Data.Map.Internal.Map
                Data.Text.Internal.Text MetaValue)
  | MetaList ![MetaValue]
  | MetaBool !Bool
  | MetaString !Data.Text.Internal.Text
  | MetaInlines ![Inline]
  | MetaBlocks ![Block]
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Ord MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Show MetaValue -- Defined in ‘Text.Pandoc.Definition’
instance Read MetaValue -- Defined in ‘Text.Pandoc.Definition’
ghci> :i Block
type Block :: *
data Block
  = Plain ![Inline]
  | Para ![Inline]
  | LineBlock ![[Inline]]
  | CodeBlock !Attr !Data.Text.Internal.Text
  | RawBlock !Format !Data.Text.Internal.Text
  | BlockQuote ![Block]
  | OrderedList !ListAttributes ![[Block]]
  | BulletList ![[Block]]
  | DefinitionList ![([Inline], [[Block]])]
  | Header {-# UNPACK #-}Int !Attr ![Inline]
  | HorizontalRule
  | Table !Attr
          !Caption
          ![ColSpec]
          !TableHead
          ![TableBody]
          !TableFoot
  | Div !Attr ![Block]
  | Null
  	-- Defined in ‘Text.Pandoc.Definition’
instance Eq Block -- Defined in ‘Text.Pandoc.Definition’
instance Ord Block -- Defined in ‘Text.Pandoc.Definition’
instance Show Block -- Defined in ‘Text.Pandoc.Definition’
instance Read Block -- Defined in ‘Text.Pandoc.Definition’

@kimminss0
Copy link

kimminss0 commented Nov 8, 2024

#643 (comment)

My preferred solution would be for Hakyll to simply not erase the original Pandoc metadata. Does anyone expect it to do that? You'd expect it to read a Pandoc as specified from the files specified, and it to not molest the Pandoc to erase the metadata or whatever is going on behind the scenes there.

Using getResourceString instead of getResourceBody seems to solve the problem.


getResourceBody :: Compiler (Item String)
Get the full contents of the matched source file as a string, but without metadata preamble, if there was one.

getResourceString :: Compiler (Item String)
Get the full contents of the matched source file as a string.

According to the documentation, we can retrieve the raw content including the yaml preamble part, using getResourceString.

Since pandocCompiler is identical to getResourceBody >>= renderPandoc, we can use getResourceString >>= renderPandoc to pass the metadata to Pandoc. Also, there is a series of renderPandocWith, renderPandocWithTransform, renderPandocWithTransformM, renderPandocItemWithTransform in the same manner.

@gwern
Copy link
Contributor

gwern commented Nov 8, 2024

That's not really a solution: that's an ad hoc workaround which forces all the work of parsing and munging and compiling onto the user (defeating the whole point of Hakyll, which is to not spend my time doing that piping - the sort of thing which, say, forces people to stop using Hakyll entirely or write their own libraries & close the issue...), to fix a design decision to silently destroy user data - a choice which still has not yet been given any justification at all, currently does not appear to even have been intentional (just an oversight), and which has many clear reasons to reject.

@jaspervdj
Copy link
Owner

I also think that being able to reuse metadata from Pandoc would be good.

One of the reasons why this wasn't done when I initially wrote Hakyll is that:

  1. I don't think Pandoc nicely exposed the metadata API back then. This is no longer relevant.
  2. We want to be able to read the metadata sections of a large number of pages relatively fast (since you can use it in routing, etc.). We don't want to parse the whole markdown file just to retrieve the metadata.

@kimminss0
Copy link

kimminss0 commented Nov 8, 2024

If we modify the default implementation of pandocCompiler to use getResourceString and pass preambles to Pandoc, would that resolve the issue? I’m unsure about backward compatibility, so perhaps we should give users the option to choose. Am I missing anything?

@gwern
Copy link
Contributor

gwern commented Nov 9, 2024

I am not sure. Apparently Pandoc metadata is not necessarily just a preamble at the start of the Markdown file, and you can have multiple YAML blocks anywhere in a file. (I guess this is to support templating / concatenating files and overriding defaults.) This came up recently in trying to add a safe lint warning about the YAML metadata: jgm/pandoc#10312

@gwern
Copy link
Contributor

gwern commented Dec 3, 2024

Another example of how this bit me, and how deeply unexpected it is for Hakyll to go out of its way to erase the metadata on a Pandoc object:

A few days ago I noticed a misspelled field in one of my old essays while checking for another problem (modifed: instead of modified:); I made a mental note that I should add a check for metadata fields which are not on a whitelist, which could catch all typos like that. The easy simple obvious thing is to just add a function call to the Pandoc phases which first reads the metadata and does some basic sanity checks and errors out if the metadata is bad, like a modifed field which is not on the whitelist of allowed fields. This morning I took 15 minutes to write up & document a simple clean little function to take the Pandoc, extract the metadata from its wrappers and get the keys as a list, and check for some mandatory fields and that all entries were in a white list - straightforward and reliable and easy to extend, write, and read...

Only for it to crash on the first page the moment I tried to rebuild my site to winkle out all remaining typos.

Because the metadata was... empty. Huh?!?!?! How can it possibly be empty? That particular page is perfect, I check it all the time because it's the first one, of course it has Pandoc metadata, it's impossible for the metadata to not be there. ...Oh right. That bug.

I then looked at Context.hs and after a few minutes of looking at all the types and puzzling over the monoid and wondering how the heck I would implement this very simple, easy, desirable, (already-implemented - at least, if Hakyll wasn't screwing things up) lint with that, gave up. Oh well. That was a waste of a good hour.

I will just have to spot the remaining metadata problems the hard way, it seems.

gwern added a commit to gwern/gwern.net that referenced this issue Dec 4, 2024
@gwern
Copy link
Contributor

gwern commented Jan 2, 2025

Another example: several hundred pages on Gwern.net have an associated thumbnail image associated with them (a graph, a chart, some AI-generated art etc). Those are shown in social media 'card' previews & in popups of that page, but not in the page itself. We'd like to show the image in the page itself somewhere (since I work pretty hard on some of those), and have currently settled on appending it to the end of the abstract. But we obviously do not want to manually edit in several hundred image links, which will be redundant with the page metadata (DRY), clutter the abstracts, cause various downstream problems, etc. So the logical way would be, when compiling, to simply extract it from the page metadata, walk the Pandoc, and inject a Figure element inside the div.abstract...

Except oh wait, right, you can't - not because the walking is infeasible or the logic is too squirrelly or any real problem, but because the metadata has been erased!

So now we just do that with Javascript. It damages layout reflow and is more runtime load on the client, but at least it was easy to write and doesn't involve anything crazy like 'rereading and parsing every Markdown file twice in order to get a single metadata value to pass into the Pandoc object in order to rewrite that (after querying it to make sure the rewrite hadn't happened before)'.

@kimminss0
Copy link

kimminss0 commented Jan 2, 2025

#643 (comment)

According to the documentation, we can retrieve the raw content including the yaml preamble part, using getResourceString.

Since pandocCompiler is identical to getResourceBody >>= renderPandoc, we can use getResourceString >>= renderPandoc to pass the metadata to Pandoc. Also, there is a series of renderPandocWith, renderPandocWithTransform, renderPandocWithTransformM, renderPandocItemWithTransformM in the same manner.

If you need a workaround, this may suffice anyway. In this way, the metadata aren't removed and you can use them within renderPandocWithTransform, renderPandocWithTransformM, renderPandocItemWithTransformM (instead of pandocCompilerWithTransform, pandocCompilerWithTransform, pandocItemCompilerWithTransformM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants