Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a marshalled output for index generation #1084

Merged
merged 6 commits into from
Jul 8, 2024

Conversation

panglesd
Copy link
Collaborator

This PR adds a new marshalled output for the generation of index files.

This is desirable for several reasons:

  • It allows to combine several pre-built indexes, making the index generation more incremental
  • It can be read by other OCaml tools using odoc as library, for instance when building a search engine index (cc @EmileTrotignon). Provide a Complete Module Index #1022 is also relevant.

The first commit fixes a bug in the computation of the ID of the page containing a standalone comment. The second commit is the actual change.

There are some design to discuss:

  • This PR adds a --marshall flag to the compile-index: by default the output is json, but it is marshalled with this flag.
  • The files containing marshelled indexes have to be prefixed by index- and have to have the .odoc extension.
  • Currently, the format is a hashtable. I think it would be better to have a better datastructure, similar to the one to contain occurrences.
  • The complexity when merging indexes many times is not ideal. For instance the following commands would be in O(n^2)
$ odoc compile-index --marshall -o index-all1.odoc a1.odocl
$ odoc compile-index --marshall -o index-all2.odoc a2.odocl index-a1.odoc
$ odoc compile-index --marshall -o index-all3.odoc a3.odocl index-a2.odoc
$ odoc compile-index --marshall -o index-all4.odoc a4.odocl index-a3.odoc
$ odoc compile-index --marshall -o index-all5.odoc a5.odocl index-a4.odoc
...
$ odoc compile-index --marshall -o index-all<n>.odoc a<n>.odocl index-a<n-1>.odoc
$

But I think that is ok.

@EmileTrotignon
Copy link
Collaborator

I do not like the index-<name>.odoc syntax. It is unusual to express file type with a prefix, and it might break scripts/makefiles that use *.odoc. .odoci or .odoc_index would be better.

@Drup
Copy link
Contributor

Drup commented Mar 18, 2024

I must admit I'm quite confused at the difference between *.odoc and *.odocl (are there others) ?

@EmileTrotignon
Copy link
Collaborator

@Drup
.odoc represent the data of a single page. .odocl also, but the page has been linked with others so that it is possible to resolve references.

In term of command line, you need a single module (cmti ?) to create a .odoc, but you need every .odoc to create a .odocl.
There are no others except the one in this proposal.

@panglesd
Copy link
Collaborator Author

I think the idea is that .odoc files are the result of the compile phase (the odoc compile command) while .odocl files are the result of the link phase (odoc link). Those phases and the dependencies inside them is not completely defined in odoc, but we are working on this.

In addition to the compile command (which create files corresponding to modules and standalone pages), several other commands generate .odoc files:

  • odoc compile-src produces a .odoc file (containing the required implementation info to render source code)
  • odoc count-occurrences produces a .odoc file (containing a marshalled table mapping ids to number of occurrences)
  • odoc source-tree produces a .odoc file (containing a source tree used to generate html navigation of directories when rendering source code)

Implementation and source trees are produced during the compile phase, and are meant to be linked. Contrary to this, occurrences and indexes are generated after the link phase, so I definitely agree that their extension should not be odoc nor odocl.

About the index-<name>.odoc syntax: it is not that unusal, at least in odoc:

  • pages must be named page-<name>.odoc
  • implementations must be named src-<name>.odoc
  • occurrence tables must be named occurrences-<name>.odoc
  • source trees must be named srctree-<name>.odoc

So, I think we have to options:

  1. Define a single extension for "post-link" files such as occurrences and indexes, and use prefix to distinguish them. For instance occurrences-<name>.odoca, index-<name>.odoca and toc-<name>.odoca (table of contents that will be introduced in the future might need a post-link file)
  2. Define one extension per "post-link" files: <name>.odoc-occurrences, <name>.odoc-index, <name>.odoc-toc.

I would prefer the first option.

@EmileTrotignon
Copy link
Collaborator

About the index-.odoc syntax: it is not that unusal, at least in odoc:

I know it is quite common in odoc, but at least for files names I really do not like it because it is a weird convention that fails to communicate its intent, so this requires knowledge of odoc to understand. It might also restrict names available for users, in a way that will be unexpected : no one will try to name a page .index (or at least they will understand that it might have meaning), but index- might happen.

I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca have completely different (binary) content and they cannot be used interchangeably at all.

Regarding implementation and sources, it seems to me that they are used in the same context (linking) and as such can share an extension. Their content could even be of the same ocaml type if we wanted to (maybe they are ?).

I deeply favor (2), even though I understand that it requires changes out of the scope of this PR.

The only benefits for (1) I found in your message is the fact it is done like that in other places in odoc, but as it seems insufficient with regard to the drawbacks expressed above. Maybe there are other benefits I did not consider ?

@EmileTrotignon
Copy link
Collaborator

The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page

@panglesd
Copy link
Collaborator Author

I'm not decided on which is better, but let me still answer to some of you comments

this requires knowledge of odoc

Using odoc requires knowledge of odoc. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!

no one will try to name a page .index (or at least they will understand that it might have meaning), but index- might happen

This is not a problem at all. I don't think there can be any issue:

  • a compilation unit named <name> cannot have - in it
  • Other odoc produces artifact are prefixed: for instance, the compilation of index-bli.mld will generate page-index-bli.mld and that would not cause problems

I also do not like that file with the same extension are in fact of a different "type". I think you should be able to trust that the extension tells you what kind of file a file is, and in (1), all the .odoca have completely different (binary) content and they cannot be used interchangeably at all.

odoc files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents.
Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.


So:

  1. index-<name>.odoca, occurrences-<name>.odoca:
    • Makes it explicit which phase those files belong to: .odoca files belong to the "post-link" phase, which takes .odocl files as input.
    • Consistent with what exists in the compile (.odoc) and link (.odocl) phase
  2. <name>.odoc-index, <name>.odoc-occurrences
    • @EmileTrotignon could you tell what what are the main benefits, considering what I wrote above?

@panglesd
Copy link
Collaborator Author

The interface changes from the first commit are fine, sherlodoc only uses Fold.unit and Fold.page

(sherlodoc uses entries_of_item, which is also changed by the bugfix.)

@EmileTrotignon
Copy link
Collaborator

(sherlodoc uses entries_of_item, which is also changed by the bugfix.)
I had missed that, the change still looks fine.

@EmileTrotignon
Copy link
Collaborator

EmileTrotignon commented Mar 18, 2024

Using odoc requires knowledge of odoc. Only odoc drivers are expected to call the command line. So I don't think this is such a big deal!

Yes, but it that case, we make up our own convention when a perfectly fine one already exists, so even though it's not unreasonable to ask people to learn things, my impression is that here, we are making it harder when its not necessary.

Other odoc produces artifact are prefixed: for instance, the compilation of index-bli.mld will generate page-index-bli.mld and that would not cause problems

You are right that it will not cause conflicts, but it could be confusing for people who might be trying to understand what is happening looking at _build, for instance if you also had bli.mld. Having a bli.odoc-index would be immediately understood for what it is.

odoc files have completely different binary content, and cannot be used interchangeably: for instance, only pages can be passed as parents.
Similarly, json files can have completely different content and not be used interchangeably, but it still makes sense to give them the same extension.

Yes, I do not mean that they should be used completely interchangeably, this is not possible if two file have different content, but that there exists a context where they are all accepted. This is the case for json (you can always pass a json file to jq), and I hope it is for .odoc (a .odoc is always valid as something to be linked against ? It can always be used to generate and odocl ?).
There is no such context for occurrences and index files, the only thing they have in common is being produced at link time, which is not the usual meaning of a file extension.

Regarding consistency, I think that this convention is more appropriate for src- and page- because they have shared uses, which is not the case for index- and occurrences-. (This could reformulated into : the meaning .odoc is not "the output of the compile phase" but "the input of the link phase", and there would be no such meaning for ".odoca")

To me, the main benefit of option (2) is that it makes it clear what you can do with the file, in a way that is understandable with minimal friction. This feel like a way more important information about a file than when it was produced. This is also more important because these files will used by tools external to odoc.

I think I have expressed myself fully, maybe a third person could take the final decision if you still disagree after reading this. .odoca is already a lot better than just .odoc.

in
Arg.(
value & opt_all convert_fpath []
& info ~doc ~docv:"FILE" [ "file-list" ])
in
let marshall =
let doc = "whether to output a json file, or an .odoc file" in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick : I think it should be "a .odoc file" (pronounced "a dot odoc file")

@EmileTrotignon
Copy link
Collaborator

I have read the diff of the second commit and it looks good to me aside from a very small nitpick.

@EmileTrotignon
Copy link
Collaborator

This has been overhauled significantly, and a re-review would be appropriate.

The extension for index files is now .odoc-index, and you give .odocl files with odoc compile-index --include-rec my_dir .

Copy link
Collaborator Author

@panglesd panglesd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the driver part!

It looks good to me. It would be nice to:

  • add some parallelism, which is easy here (see eg panglesd@2dadfb5)
  • Add a progress bar for indexes, since it can take quite a lot of time

ignore @@ Bos.OS.Dir.create Fpath.(html_dir / pkgname);
let format = `js in
let inputs = [ Fpath.(odoc_dir / pkgname / "index.odoc-index") ] in
let dst = Fpath.(html_dir / pkgname / "sherlodoc_db.js") in
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename is hardcoded and reused. It would be better to return it or something like that.

Comment on lines 257 to 283
let sherlodoc ~html_dir ~odoc_dir pkgs =
ignore @@ Bos.OS.Dir.create html_dir;
Sherlodoc.js Fpath.(html_dir / "sherlodoc.js");
Util.StringMap.iter (sherlodoc_index_one ~html_dir ~odoc_dir) pkgs;
let format = `marshal in
let dst = Fpath.(html_dir / "sherlodoc_db.marshal") in
let inputs =
pkgs |> Util.StringMap.bindings
|> List.map (fun (pkgname, _pkg) ->
Fpath.(odoc_dir / pkgname / "index.odoc-index"))
in
Sherlodoc.index ~format ~inputs ~dst ()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several hardcoded path there, which should be returned to avoid having them hardcoded in distinct places.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is a good idea to return them, because --search_uri wants relative paths. I have factored them in functions and values and its not repeated now.

@@ -1,31 +1,11 @@
open Bos
open Cmd_outputs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not open that globally...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

src/odoc/bin/main.ml Outdated Show resolved Hide resolved
src/odoc/bin/main.ml Outdated Show resolved Hide resolved
src/odoc/bin/main.ml Show resolved Hide resolved
src/odoc/indexing.ml Outdated Show resolved Hide resolved
src/odoc/indexing.ml Outdated Show resolved Hide resolved
src/odoc/indexing.mli Show resolved Hide resolved
src/odoc/resolver.ml Outdated Show resolved Hide resolved
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests and prose around needs to be updated!

odoc-driver.opam Outdated
@@ -43,6 +43,7 @@ depends: [
"eio_main"
"progress"
"cmdliner"
"sherlodoc"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pinned version is needed, for now, I think.

:: impl
in
Fiber.List.map link compiled |> List.concat

let odoc_index_path ~odoc_dir pkgname =
Fpath.(odoc_dir / pkgname / "index.odoc-index")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be cleaner to output index files outside of the directory made for .odocl files. That is:

Suggested change
Fpath.(odoc_dir / pkgname / "index.odoc-index")
Fpath.(index_dir / pkgname / "index.odoc-index")

or

Suggested change
Fpath.(odoc_dir / pkgname / "index.odoc-index")
Fpath.(index_dir / pkgname ^ ".odoc-index")

Although, this is not mandatory. Just resolve the comment if you believe it is better like that.

Copy link
Collaborator

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) Made some comments but nothing that can't be changed later.

@@ -511,13 +527,28 @@ module Indexing = struct
value & opt_all convert_fpath []
& info ~doc ~docv:"FILE" [ "file-list" ])
in
let include_rec =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this option ? The inputs term could also accept directories, which removes an exception case and is common among CLI tools.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, this option will change to -P and -L soon: we will use the index for sidebars too, and sidebars require those options.

Comment on lines 506 to 507
"At least one of --file-list or --include-rec or an .odocl file \
must be passed to odoc compile-index")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already checked in Indexing.compile but in a slightly different way: If --file-list points to an empty file, this passes but Indexing.compile has an other check that do not.

Why shouldn't making an empty index be allowed ? For example, drivers might want to make an index for every packages, even empty or bugous packages. This restriction increases complexity in drivers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

[] include_rec)
|> List.concat)
in
if files = [] then Error (`Msg "No .odocl files were included")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there's several indirect ways to pass the list of inputs, to reassure driver authors, we could log the number of odocl files considered for indexing:

Indexing N odocl files.

('a, [> msg ]) result
(** This function is exposed for custom indexers that uses [odoc] as a library
to generate their search index *)

val compile :
[ `JSON | `Marshall ] ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for each formats ? It would be nice to write that down for maintenance in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use cases for json are for external use of indexes, which include using external search engine but is not really limited...
The usecase for the marshalled format is for odoc: currently search, soon table of contents, maybe indices of values/types/... so it is going to evolve.

I'm not sure how much it would help maintenance or hurt it to have the usecases written there, as they are likely going to evolve and the explanation will be outdated. Maybe a simple sentence (`Marshall is for uses of the index by odoc, `JSON is for external uses) would be sufficient?

Standalone documentation comments currently do not have an id. This id was
carried as the accumulator of the field, which yielded wrong results!

Signed-off-by: Paul-Elliot <[email protected]>
@panglesd panglesd force-pushed the search-index-buid-system-friendly branch 2 times, most recently from eb16010 to f0acee4 Compare July 8, 2024 08:06
@Julow
Copy link
Collaborator

Julow commented Jul 8, 2024

The CI is not passing due to the new dependency: https://ocaml.ci.dev/github/ocaml/odoc/commit/f0acee4b79947a0a4ff1643bc675e877045b98a3/variant/debian-12-5.0_opam-2.1

Otherwise, this is good to merge !

@panglesd panglesd force-pushed the search-index-buid-system-friendly branch from 23cff2f to 906c96d Compare July 8, 2024 14:36
@panglesd panglesd merged commit 934fe01 into ocaml:master Jul 8, 2024
13 checks passed
@panglesd
Copy link
Collaborator Author

panglesd commented Jul 8, 2024

I opened an issue on ocaml-ci : ocurrent/ocaml-ci#954

As soon as it is solved, we should reintroduce the driver's dependency on sherlodoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants