[FEATURE] Support Mirador inline search #17

alxp · 2023-01-10T19:08:54Z

Overview of feature request

Mirador 3 includes an internal search function, but requires a query endpoint. Determine the best way to add this to Islandora.

The UT Scarborough islandora developers have implemented a fork of Islandora Mirador with annotation support which we should be able to pull from in the annotations branch of the repo here:

https://github.com/digitalutsc/islandora_mirador/tree/annotations

What kind of user is the feature intended for?
(Example user roles: Collections Manager, Developer, Systems Administrator, or User)

End user

What inspired the request?

Ongoing discussion of features needed for paged content.

What existing behavior do you want changed?

Extracted text action may be modified to make associating it with a given media easier.

Any brand new behavior do you want to add to Islandora?

A compatible endpoint to serve search queries compatible with Mirador's inline search.

Any related open or closed issues to this feature request?

kstapelfeldt · 2023-01-11T18:47:27Z

We are ready to report back some progress, as we have an implementation of in-text search and highlighting using Simple Annotation Server on one of our production servers. You can see an example of what we have working by clicking here: https://memory.digital.utsc.utoronto.ca/61220/utsc11543?q=student - the active result is yellow in the viewer, and changes as you click through the list. Logged in users who are administrators can edit the text and save it back to the simple annotation store.

The connection with Drupal node is preserved through a custom field that matches the annotation ID and the node ID. We have a Java converter that transforms Google Vision’s output JSON into the format required for IIIF search, and we assume that this pattern could be followed for things like HOCR. I’m attaching our rough diagram of the workflow in case it’s of interest. We’re happy to answer questions, and Kyle has updated the demo implementation

https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype

kstapelfeldt · 2023-01-11T18:49:13Z

@kylehuynh205 @Natkeeran

alxp · 2023-01-25T14:34:43Z

@wgilling @patdunlavey I mentioned in the committers call last week that there's a more generic plugin for highlighting hOCR in Solr that might be better than a secondary endpoint search in Mirador.

https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/

Looks like it would also require a second Solr index alongside Drupal but it would be a more unified Islandora experience and the demo does look slick. And it handles things like phrases across column / page boundaries which is really cool.

patdunlavey · 2023-01-25T18:47:34Z

@alxp funny, I just got to that solr-ocrhighlighting project by other means.

I've been reviewing the SimpleAnnotationServer approach, in particular utsc's wiki page, and from what I can see the SAS is simply a cache of pre-generated annotations which can be searched to return a list of matching annotations. The process for generating the annotations and loading them into that cache is extremely specific to utsc's workflow, and involves a lot of manual steps per OCR'd page. The primary attraction seems to be that very little custom code was needed to get it working. I'm sure the manual steps could be largely mechanized, but now you're writing code - and we have to solve for a much more general case.

I've been looking at how to get the extracted hOCR indexed by search_api_solr. The search_api_attachments module should help with this, but It doesn't follow reverse entity references. This issue suggests a possible workaround, though it seems awfully hinky. Would we need to custom code something just to make the contents of the extracted OCR available to index in solr?

I'm not sure how we would instruct solr to index the hOCR currently being generated from islandora_text_extraction. Presumably that's where the solr-ocrhighlighting idea could come in. It looks like Archipelago uses this library in defining an ocr solr field type. It's not clear to me where/how the library comes to be instantiated on the solr image, since it's just using a standard solr docker image.

What indicates to you that it would need to utilize a second solr index?

DiegoPino · 2023-02-02T15:08:04Z

@patdunlavey @alxp to put some context into @patdunlavey statements here.

Archipelago has been using the plugin that Johannes from the Bavarian state library and his team developed for almost 3 years already. We worked with their team and our own folks to architect this deeply into our system, testing, expanding this idea into a more complex and generic way of producing the many-to-one needs coming from different sources of OCR/HOCR and have already repositories in our Archipelago community that have hit over 700K documents with real time capabilities.

But to allow this to really work the Solr side of things @patdunlavey is pointing to is just one of the factors. Drupal is terrible handling that number of entities. So we created a whole ecosystem around an entity-less custom Search API Data Source https://github.com/esmero/strawberryfield/blob/1.1.0/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php

that generates a different type of Solr Documents (same index, other indexes, multiple indexes, etc) just for this case connected/drupal data wise via a native data type that simulates what an entity would do:

https://github.com/esmero/strawberryfield/blob/1.1.0/src/TypedData/StrawberryfieldFlavorDataDefinition.php

And this is just the start. We tap into the Search API queries (modifying them at the solarium level) to allow highlights to work by disabling the native one too (both can not co-exist) and have a TON of event subscribers that track these type of document needs of update/removal plus a hierarchical backend processing plugin system to extract OCR https://github.com/esmero/strawberry_runners/tree/0.5.0 amongst many other type of data (e.g WACZ full text and URLs, XML, Simple text) that go into Strawberry Flavors (that is how we name this special thing) and if you dig deeper you will see much more integration like aggregated fields that harvest from Solr, etc is present.

Full text search is driven by custom Controllers and on our recent code we do front-end/back-end matching of our Dynamic IIIF Manifests too to allow IIIF Search API capabilities. Annotations are handled separately (the plugin that is mentioned deals with MiniOCR or ALTO only) and already embedded in each ADO (Archipelago Digital Object) JSON, we have also have done joint work with the Annotorious team to enable that.

In other words, this is a totally different architecture and implies also tons of code to make it work. If you decide to go this way, and decide to use code from our system I would appreciate you keep attribution (what we have is not test code and examples, its production code many institutions are using), researching and developing this into a production ready system in our community was a big communal effort. I would also encourage you to test Archipelago in that sense to have an idea of what is implied. Thanks a lot

patdunlavey · 2023-02-17T16:45:02Z

I spoke with @ajstanley who advocated for the importance of ocr'd data being human-editable. If we accept that premise, then it may argue in favor of storing the hOCR data in a drupal field on the media. Then a field widget could, in theory, be designed for people to edit the hOCR data. I can imagine dropping in a tool like this.

patdunlavey · 2023-03-24T21:03:02Z

Some updates to what I've been doing on this.

I added a field on the "Original File" media for storing the raw hocr text ("field_editable_hocr_text"), and added this field to the generate_hocr_extracted_text action so that the hocr output goes both to the file field and this text field:

(Note that this results in hocr text with embedded "<br />" tags, which is something I need to fix. For now, I just fix the malformed xml manually.)

I have solr ocrHighlighting working on my local. This took some work which I won't go into every detail on at the moment. The main points:

In our Makefile, we get the ocrhighlighting library like this:

	docker-compose exec -T solr with-contenv bash -lc "rm -rf /opt/solr/server/solr/ISLANDORA /opt/solr/server/solr/contrib/ocrhighlighting/lib/solr-ocrhighlighting.jar"
	docker-compose exec -T drupal with-contenv bash -lc "for_all_sites create_solr_core_with_default_config"
	curl -k -L https://github.com/dbmdz/solr-ocrhighlighting/releases/download/0.7.2/solr-ocrhighlighting-0.7.2.jar > data/solr-ocrhighlighting.jar
	docker-compose exec -T solr with-contenv bash -lc "mkdir -p /opt/solr/server/solr/contrib/ocrhighlighting/lib"
	docker cp data/solr-ocrhighlighting.jar $$(docker-compose ps -q solr):/opt/solr/server/solr/contrib/ocrhighlighting/lib/solr-ocrhighlighting.jar 
	docker-compose exec -T solr with-contenv bash -lc "chown -R solr:solr /opt/solr/server/solr/contrib/ocrhighlighting"

We created modified versions of the solrconfig.yml and schema.xml files that we use a similar technique in our Makefile to load into solr. I'm attaching those.
solrconfig-schema.zip.
There was some troubleshooting with solr versions too that I've lost track of (it's in our docker-compose.yml).
I added a search api solr field definition for the hocr field:
search_api_solr.solr_field_type.text_ocr_und_7_0_0.zip
I defined a field in the search api index that indexes this:

  field_editable_hocr_text:
    label: 'HOCR Text'
    datasource_id: 'entity:node'
    property_path: 'search_api_reverse_entity_references_media__field_media_of:field_editable_hocr_text'
    type: 'solr_text_custom:ocr_highlight'

Phew!

With this all in place and working, I can see this in my test solr query result:

  "ocrHighlighting":{
    "4o0hnj-default_solr_index-entity:node/99:en":{
      "tcocr_highlightm_X3b_en_field_editable_hocr_text":{
        "snippets":[{
            "text":"bands, gave <em>Bix</em> his first big job,",
            "score":62655.01,
            "pages":[{
                "id":"page_1",
                "width":2352,
                "height":2810}],
            "regions":[{
                "ulx":1702,
                "uly":1729,
                "lrx":2035,
                "lry":1754,
                "text":"bands, gave <em>Bix</em> his first big job,",
                "pageIdx":0}],
            "highlights":[[{
                  "ulx":128,
                  "uly":1,
                  "lrx":158,
                  "lry":18,
                  "text":"Bix",
                  "parentRegionIdx":0}]]}],
        "numTotal":10}}},

My next step is to write a controller to perform a search and return a list of IIIF annotations! Then, in theory, we should be able to plug that link into our IIIF manifest.

patdunlavey · 2023-03-27T17:26:50Z

@ajstanley might it make sense to create a custom hocr field type/widget/formatter? For starters, it could solve the embedded "<br />" problem. My thinking is that initially we would just provide a plain text widget, and then later add in an hOCR editor like this one: https://github.com/GeReV/hocr-editor-ts. Is this a project I could entice you to take on?

Would this be part of the islandora_mirador module? In any case, at the very least, we would need a PR against islandora_text_extraction to enable using our new field type here.

ajstanley · 2023-03-28T13:17:37Z

@patdunlavey I can absolutely take that widget on.
I've got https://github.com/GeReV/hocr-editor-ts working in a demo environment (it needed a LOT of updating to compile) but is not really useful in its current incarnation.

patdunlavey · 2023-03-28T20:48:37Z

@ajstanley Great to hear that you got the hocr editor working, if only after a fashion. Do you think it can be made useful, or do you think you need to look elsewhere?

ajstanley · 2023-03-29T13:43:16Z

@patdunlavey I think it's a non-starter. You can see my working version here, but it's going to need a whole lot of deep-tissue massaging to be useful.
This app builds the hOCR, but we're doing that already. The editing seems really clunky.

patdunlavey · 2023-03-29T15:06:31Z

Do you see any path forward on this? I'd say that minimum viable (initial) product is a field type that can hold the output of hOCR. We could just modify isladora_text_extraction to permit writing to a plain text field, but I'm thinking that having a special hOCR field type would make other aspects of the overall project easier.

ajstanley · 2023-03-29T15:26:48Z

The text field that's there already allows for correcting text, we could make the hOCR human-editable as well, but that would be an onerous undertaking for the unfortunate grad student who was saddled with it.

If we can start with having the hOCR viewable, but pull it from a field on the media rather than from a saved file, we can add editing functionality later.

Baby steps...

patdunlavey · 2023-03-29T15:44:13Z

As I described previously, I added a field to the media type to store the hOCR. This is in addition to the file field that @alxp 's hocr text overlay work uses. My reason for adding the long text field is 1. because that's what search api can index using a reverse entity reference from the node to the field on the media, and 2. "text_long" because that's what islandora_text_extraction dictates here (thus my suggestion that we make a PR to change that). The "text_long" field would be fine with me if it didn't insist on inserting <br /> for new lines in the saved text.

Having a field type that islandora_mirador defines, my thinking goes, would permit us to not have to design hacky logic to determine what media field contains our hocr, though we could also avoid that by just making the source field name configurable in the islandora_mirador config page, which may be best in any case. So maybe I'm getting talked out of the need for a special field type, at least as long as the wysiwyg hocr editing tool is out of scope.

patdunlavey · 2023-04-05T14:13:04Z

I have a not-yet-fully-tested version of a IIIF Search API endpoint working. It generates AnnotationLists when given a node id of the page or paged content node and a search term.

Here's my fork of islandora_mirador. There's a lot of setup involved, which I tried to fully document in the README.

The primary missing piece that I'm aware of is the part of \Drupal\islandora_iiif\Plugin\views\style\IIIFManifest::render that needs to provide the search block in which our search endpoint will be instantiated.

@alxp @ajstanley @dmer @Islandora/committers

adam-vessey · 2023-04-05T16:48:28Z

If we're implementing a IIIF Content Search API endpoint, does it really belong as a part of Mirador (or rather, islandora_mirador), specifically? Seems more like a IIIF thing, no? Like, might belong more so in the islandora_iiif module proper? Or some other associated module?

Looking at the comparison, there's a few things to highlight:

Implementing of deprecated hooks (the hook_search_api_solr_query_alter() bit; should be done event-wise)
Injecting but not using services (OcrSearchController has the entity_type.manager referenced/passed in its ::create() and assigned to an (undeclared?) entityTypeManager property in ::construct(); however, there's still multiple references to \Drupal::entityTypeManager() in the various controller methods.
Manual/explicit string manipulation for URL creation (2.x...patdunlavey:islandora_mirador:ISSUE-17-inline-search#diff-760a5f5b54a49de09faca5236c8ee9e0a837da99811cb4f74f54938f2d5392eeR301-R303), instead of using Drupal's URL building facilities
It looks like there's events(/hooks) to allow for the definition of more extensive Solr configs (instead of including Solr and Drupal configurations in a subdirectory with instruction to use them) such as:
- PostConfigFilesGenerationEvent(/hook_search_api_solr_config_files_alter()), to allow bits of schema to be included
- Plugin definitions to define data types (and more?): https://www.drupal.org/docs/8/modules/search-api/developer-documentation/available-plugin-types#data-types

seth-shaw-asu · 2023-04-05T17:29:20Z

If we're implementing a IIIF Content Search API endpoint, does it really belong as a part of Mirador (or rather, islandora_mirador), specifically? Seems more like a IIIF thing, no? Like, might belong more so in the islandora_iiif module proper? Or some other associated module?

I second the idea of pushing these changes into islandora_iiif. The only bit that looks mirador specific is the mirador config form.

mjordan · 2023-04-05T18:05:36Z

I agree. I think the IIF Content Search API has uses outside of Mirador and should be implemented as a separately.

DiegoPino · 2023-04-06T13:23:27Z

Hey @patdunlavey and the @Islandora/committers here. I'm raising a red flag 🟥

OSS does not mean copy and paste without attributions. You all know this. The shared work here is heavily "based" (from the devops to the implementation) on our own research and tested code and production implementations (years old already). not to mention that Pat, you are part of the Archipelago community and you had the chance to test it, used it in production and even get 1:1 with us about how it works. Not even variable name changes, even many of my inline comments.

e.g patdunlavey@9d9269c is more than heavily based on https://github.com/esmero/strawberryfield/blob/573ffa44a369ad68c59a92b2746258c2671ef13f/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L189

And this 2.x...patdunlavey:islandora_mirador:ISSUE-17-inline-search#diff-74bee3a8afb13b6345660264b05398e816ce69bff9b8d3d26a45682f92bb8c44R91-R110 (except for the "mysterious why" comment) is 1:1 to
https://github.com/esmero/strawberryfield/blob/573ffa44a369ad68c59a92b2746258c2671ef13f/strawberryfield.module#L333-L374

But on your side of things, you should also check on these things, the fact that there are comments in https://github.com/esmero/strawberryfield/blob/1.1.0/strawberryfield.module#L333-L374 replacing mine about this not being understood

 It's a mystery as to why it should be necessary to alter the solarium query in order to add the
 * highlight parameters. We should be able to add them inside the search_api query build using the
 * `solr_param_` method to inject solarium parameters: https://git.drupalcode.org/project/search_api_solr/-/blob/4.x/src/Plugin/search_api/backend/SearchApiSolrBackend.php#L1605-1610
 * However that is not working, and so we're reduced to this ugly alter hook.

Means you (the we in that comment) are not copying the why, and the lack of research (and why I wrote it like that) implies you might be even copying bugs.

And I can keep going.

What really bothers me here is the idea of "I did the research and found out" instead of being clear of where this comes from. If this was a contribution coming from an individual not representing an institution I would be less concerned and even let a few of these pass, but this is not.

Attribution in GPL (we are V3) is not optional and this ethically also affects basically every person involved on our side, @giancarlobi worked on this, @alliomeria worked on this. Community members doing heavy work on testing, reading docs, re-testing, indexing, implementing, refining. It is a community issue of obscuring efforts. Not cool, people.

I want to hear your reactions please.

giancarlobi · 2023-04-06T13:53:05Z

Thanks @DiegoPino for this post that I fully agree mainly for the defense of what is a really open-source community. Dear Islandora friends, this is the reason why I abandoned you (I know, not a big loss for you) because your current idea of community is far away from the one I met in Arcidosso in 2013. Good luck.

patdunlavey · 2023-04-06T15:47:52Z

Morning everybody. I'm just catching up on this. Well I do seem to have stepped in it, big time! In my meager defense, the code I've been working on - which to be clear, I gratefully acknowledge that I used Archipelago code and its underlying research as my starting point - is very much a WIP and at this point, barely a proof of concept for part of the functionality that this issue proposes. I'm nowhere near proposing a PR (the code in question is currently on a task branch of my fork of this islandora_mirador module). It is and has been my intention, if this does result in my creating a PR, to run it by @DiegoPino first to get what I hope will be his blessing and, in any case, ensure that he and others are properly credited. I sincerely apologize for the bad feeling that my premature sharing of this code has generated and I promise to be more mindful going forward.

patdunlavey · 2023-04-06T16:14:07Z

Thank you @adam-vessey @seth-shaw-asu and @mjordan for your comments. Skipping for the moment your very helpful code-review comments (though I want to clarify that code review is premature at this point), it seems clear that there is a consensus that it makes most sense to re-target this issue and solution (assuming we find one we are all happy with) to the islandora_iiif module. Is that an accurate reading? @alxp do you agree with that?

alxp · 2023-04-07T17:28:42Z

Hi @patdunlavey , I definitely agree that anything primarily relating to IIIF and not specific to Mirador should be in islandora_iiif.

That module currently lives in the main Islandora module, it might be helpful to pull it out like we did with islandora_mirador but not necessary.

alxp added the enhancement New feature or request label Jan 10, 2023

alxp mentioned this issue Jan 10, 2023

Update mirador #14

Merged

dmer mentioned this issue Mar 31, 2023

Ingesting HOCR derivatives as a media attachment mjordan/islandora_workbench#592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support Mirador inline search #17

[FEATURE] Support Mirador inline search #17

alxp commented Jan 10, 2023

kstapelfeldt commented Jan 11, 2023

kstapelfeldt commented Jan 11, 2023 •

edited

Loading

alxp commented Jan 25, 2023

patdunlavey commented Jan 25, 2023

DiegoPino commented Feb 2, 2023

patdunlavey commented Feb 17, 2023

patdunlavey commented Mar 24, 2023 •

edited

Loading

patdunlavey commented Mar 27, 2023

ajstanley commented Mar 28, 2023

patdunlavey commented Mar 28, 2023

ajstanley commented Mar 29, 2023

patdunlavey commented Mar 29, 2023

ajstanley commented Mar 29, 2023

patdunlavey commented Mar 29, 2023

patdunlavey commented Apr 5, 2023 •

edited

Loading

adam-vessey commented Apr 5, 2023 •

edited

Loading

seth-shaw-asu commented Apr 5, 2023

mjordan commented Apr 5, 2023

DiegoPino commented Apr 6, 2023 •

edited

Loading

giancarlobi commented Apr 6, 2023

patdunlavey commented Apr 6, 2023

patdunlavey commented Apr 6, 2023

alxp commented Apr 7, 2023

[FEATURE] Support Mirador inline search #17

[FEATURE] Support Mirador inline search #17

Comments

alxp commented Jan 10, 2023

kstapelfeldt commented Jan 11, 2023

kstapelfeldt commented Jan 11, 2023 • edited Loading

alxp commented Jan 25, 2023

patdunlavey commented Jan 25, 2023

DiegoPino commented Feb 2, 2023

patdunlavey commented Feb 17, 2023

patdunlavey commented Mar 24, 2023 • edited Loading

patdunlavey commented Mar 27, 2023

ajstanley commented Mar 28, 2023

patdunlavey commented Mar 28, 2023

ajstanley commented Mar 29, 2023

patdunlavey commented Mar 29, 2023

ajstanley commented Mar 29, 2023

patdunlavey commented Mar 29, 2023

patdunlavey commented Apr 5, 2023 • edited Loading

adam-vessey commented Apr 5, 2023 • edited Loading

seth-shaw-asu commented Apr 5, 2023

mjordan commented Apr 5, 2023

DiegoPino commented Apr 6, 2023 • edited Loading

giancarlobi commented Apr 6, 2023

patdunlavey commented Apr 6, 2023

patdunlavey commented Apr 6, 2023

alxp commented Apr 7, 2023

kstapelfeldt commented Jan 11, 2023 •

edited

Loading

patdunlavey commented Mar 24, 2023 •

edited

Loading

patdunlavey commented Apr 5, 2023 •

edited

Loading

adam-vessey commented Apr 5, 2023 •

edited

Loading

DiegoPino commented Apr 6, 2023 •

edited

Loading