-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Support Mirador inline search #17
Comments
We are ready to report back some progress, as we have an implementation of in-text search and highlighting using Simple Annotation Server on one of our production servers. You can see an example of what we have working by clicking here: https://memory.digital.utsc.utoronto.ca/61220/utsc11543?q=student - the active result is yellow in the viewer, and changes as you click through the list. Logged in users who are administrators can edit the text and save it back to the simple annotation store. The connection with Drupal node is preserved through a custom field that matches the annotation ID and the node ID. We have a Java converter that transforms Google Vision’s output JSON into the format required for IIIF search, and we assume that this pattern could be followed for things like HOCR. I’m attaching our rough diagram of the workflow in case it’s of interest. We’re happy to answer questions, and Kyle has updated the demo implementation https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype |
@wgilling @patdunlavey I mentioned in the committers call last week that there's a more generic plugin for highlighting hOCR in Solr that might be better than a secondary endpoint search in Mirador. https://dbmdz.github.io/solr-ocrhighlighting/0.8.3/ Looks like it would also require a second Solr index alongside Drupal but it would be a more unified Islandora experience and the demo does look slick. And it handles things like phrases across column / page boundaries which is really cool. |
@alxp funny, I just got to that solr-ocrhighlighting project by other means. I've been reviewing the SimpleAnnotationServer approach, in particular utsc's wiki page, and from what I can see the SAS is simply a cache of pre-generated annotations which can be searched to return a list of matching annotations. The process for generating the annotations and loading them into that cache is extremely specific to utsc's workflow, and involves a lot of manual steps per OCR'd page. The primary attraction seems to be that very little custom code was needed to get it working. I'm sure the manual steps could be largely mechanized, but now you're writing code - and we have to solve for a much more general case. I've been looking at how to get the extracted hOCR indexed by search_api_solr. The search_api_attachments module should help with this, but It doesn't follow reverse entity references. This issue suggests a possible workaround, though it seems awfully hinky. Would we need to custom code something just to make the contents of the extracted OCR available to index in solr? I'm not sure how we would instruct solr to index the hOCR currently being generated from islandora_text_extraction. Presumably that's where the solr-ocrhighlighting idea could come in. It looks like Archipelago uses this library in defining an ocr solr field type. It's not clear to me where/how the library comes to be instantiated on the solr image, since it's just using a standard solr docker image. What indicates to you that it would need to utilize a second solr index? |
@patdunlavey @alxp to put some context into @patdunlavey statements here. Archipelago has been using the plugin that Johannes from the Bavarian state library and his team developed for almost 3 years already. We worked with their team and our own folks to architect this deeply into our system, testing, expanding this idea into a more complex and generic way of producing the many-to-one needs coming from different sources of OCR/HOCR and have already repositories in our Archipelago community that have hit over 700K documents with real time capabilities. But to allow this to really work the Solr side of things @patdunlavey is pointing to is just one of the factors. Drupal is terrible handling that number of entities. So we created a whole ecosystem around an entity-less custom Search API Data Source https://github.com/esmero/strawberryfield/blob/1.1.0/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php that generates a different type of Solr Documents (same index, other indexes, multiple indexes, etc) just for this case connected/drupal data wise via a native data type that simulates what an entity would do: And this is just the start. We tap into the Search API queries (modifying them at the solarium level) to allow highlights to work by disabling the native one too (both can not co-exist) and have a TON of event subscribers that track these type of document needs of update/removal plus a hierarchical backend processing plugin system to extract OCR https://github.com/esmero/strawberry_runners/tree/0.5.0 amongst many other type of data (e.g WACZ full text and URLs, XML, Simple text) that go into Strawberry Flavors (that is how we name this special thing) and if you dig deeper you will see much more integration like aggregated fields that harvest from Solr, etc is present. Full text search is driven by custom Controllers and on our recent code we do front-end/back-end matching of our Dynamic IIIF Manifests too to allow IIIF Search API capabilities. Annotations are handled separately (the plugin that is mentioned deals with MiniOCR or ALTO only) and already embedded in each ADO (Archipelago Digital Object) JSON, we have also have done joint work with the Annotorious team to enable that. In other words, this is a |
I spoke with @ajstanley who advocated for the importance of ocr'd data being human-editable. If we accept that premise, then it may argue in favor of storing the hOCR data in a drupal field on the media. Then a field widget could, in theory, be designed for people to edit the hOCR data. I can imagine dropping in a tool like this. |
Some updates to what I've been doing on this. I added a field on the "Original File" media for storing the raw hocr text ("field_editable_hocr_text"), and added this field to the generate_hocr_extracted_text action so that the hocr output goes both to the file field and this text field: I have solr ocrHighlighting working on my local. This took some work which I won't go into every detail on at the moment. The main points:
Phew! With this all in place and working, I can see this in my test solr query result:
My next step is to write a controller to perform a search and return a list of IIIF annotations! Then, in theory, we should be able to plug that link into our IIIF manifest. |
@ajstanley might it make sense to create a custom hocr field type/widget/formatter? For starters, it could solve the embedded " Would this be part of the islandora_mirador module? In any case, at the very least, we would need a PR against islandora_text_extraction to enable using our new field type here. |
@patdunlavey I can absolutely take that widget on. |
@ajstanley Great to hear that you got the hocr editor working, if only after a fashion. Do you think it can be made useful, or do you think you need to look elsewhere? |
@patdunlavey I think it's a non-starter. You can see my working version here, but it's going to need a whole lot of deep-tissue massaging to be useful. |
Do you see any path forward on this? I'd say that minimum viable (initial) product is a field type that can hold the output of hOCR. We could just modify isladora_text_extraction to permit writing to a plain text field, but I'm thinking that having a special hOCR field type would make other aspects of the overall project easier. |
The text field that's there already allows for correcting text, we could make the hOCR human-editable as well, but that would be an onerous undertaking for the unfortunate grad student who was saddled with it. If we can start with having the hOCR viewable, but pull it from a field on the media rather than from a saved file, we can add editing functionality later. Baby steps... |
As I described previously, I added a field to the media type to store the hOCR. This is in addition to the file field that @alxp 's hocr text overlay work uses. My reason for adding the long text field is 1. because that's what search api can index using a reverse entity reference from the node to the field on the media, and 2. "text_long" because that's what islandora_text_extraction dictates here (thus my suggestion that we make a PR to change that). The "text_long" field would be fine with me if it didn't insist on inserting Having a field type that islandora_mirador defines, my thinking goes, would permit us to not have to design hacky logic to determine what media field contains our hocr, though we could also avoid that by just making the source field name configurable in the islandora_mirador config page, which may be best in any case. So maybe I'm getting talked out of the need for a special field type, at least as long as the wysiwyg hocr editing tool is out of scope. |
I have a not-yet-fully-tested version of a IIIF Search API endpoint working. It generates AnnotationLists when given a node id of the page or paged content node and a search term. Here's my fork of islandora_mirador. There's a lot of setup involved, which I tried to fully document in the README. The primary missing piece that I'm aware of is the part of \Drupal\islandora_iiif\Plugin\views\style\IIIFManifest::render that needs to provide the search block in which our search endpoint will be instantiated. @alxp @ajstanley @dmer @Islandora/committers |
If we're implementing a IIIF Content Search API endpoint, does it really belong as a part of Mirador (or rather, Looking at the comparison, there's a few things to highlight:
|
I second the idea of pushing these changes into islandora_iiif. The only bit that looks mirador specific is the mirador config form. |
I agree. I think the IIF Content Search API has uses outside of Mirador and should be implemented as a separately. |
Hey @patdunlavey and the @Islandora/committers here. I'm raising a red flag 🟥 OSS does not mean copy and paste without attributions. You all know this. The shared work here is heavily "based" (from the e.g patdunlavey@9d9269c is more than heavily based on https://github.com/esmero/strawberryfield/blob/573ffa44a369ad68c59a92b2746258c2671ef13f/src/Controller/StrawberryfieldFlavorDatasourceSearchController.php#L189 And this 2.x...patdunlavey:islandora_mirador:ISSUE-17-inline-search#diff-74bee3a8afb13b6345660264b05398e816ce69bff9b8d3d26a45682f92bb8c44R91-R110 (except for the "mysterious why" comment) is 1:1 to But on your side of things, you should also check on these things, the fact that there are comments in https://github.com/esmero/strawberryfield/blob/1.1.0/strawberryfield.module#L333-L374 replacing mine about this not being understood
Means you (the we in that comment) are not copying the why, and the lack of research (and why I wrote it like that) implies you might be even copying bugs. And I can keep going. What really bothers me here is the idea of "I did the research and found out" instead of being clear of where this comes from. If this was a contribution coming from an individual not representing an institution I would be less concerned and even let a few of these pass, but this is not. Attribution in GPL (we are V3) is not optional and this ethically also affects basically every person involved on our side, @giancarlobi worked on this, @alliomeria worked on this. Community members doing heavy work on testing, reading docs, re-testing, indexing, implementing, refining. It is a community issue of obscuring efforts. Not cool, people. I want to hear your reactions please. |
Thanks @DiegoPino for this post that I fully agree mainly for the defense of what is a really open-source community. Dear Islandora friends, this is the reason why I abandoned you (I know, not a big loss for you) because your current idea of community is far away from the one I met in Arcidosso in 2013. Good luck. |
Morning everybody. I'm just catching up on this. Well I do seem to have stepped in it, big time! In my meager defense, the code I've been working on - which to be clear, I gratefully acknowledge that I used Archipelago code and its underlying research as my starting point - is very much a WIP and at this point, barely a proof of concept for part of the functionality that this issue proposes. I'm nowhere near proposing a PR (the code in question is currently on a task branch of my fork of this islandora_mirador module). It is and has been my intention, if this does result in my creating a PR, to run it by @DiegoPino first to get what I hope will be his blessing and, in any case, ensure that he and others are properly credited. I sincerely apologize for the bad feeling that my premature sharing of this code has generated and I promise to be more mindful going forward. |
Thank you @adam-vessey @seth-shaw-asu and @mjordan for your comments. Skipping for the moment your very helpful code-review comments (though I want to clarify that code review is premature at this point), it seems clear that there is a consensus that it makes most sense to re-target this issue and solution (assuming we find one we are all happy with) to the islandora_iiif module. Is that an accurate reading? @alxp do you agree with that? |
Hi @patdunlavey , I definitely agree that anything primarily relating to IIIF and not specific to Mirador should be in islandora_iiif. That module currently lives in the main Islandora module, it might be helpful to pull it out like we did with islandora_mirador but not necessary. |
Overview of feature request
Mirador 3 includes an internal search function, but requires a query endpoint. Determine the best way to add this to Islandora.
The UT Scarborough islandora developers have implemented a fork of Islandora Mirador with annotation support which we should be able to pull from in the annotations branch of the repo here:
https://github.com/digitalutsc/islandora_mirador/tree/annotations
What kind of user is the feature intended for?
(Example user roles: Collections Manager, Developer, Systems Administrator, or User)
End user
What inspired the request?
Ongoing discussion of features needed for paged content.
What existing behavior do you want changed?
Extracted text action may be modified to make associating it with a given media easier.
Any brand new behavior do you want to add to Islandora?
A compatible endpoint to serve search queries compatible with Mirador's inline search.
Any related open or closed issues to this feature request?
The text was updated successfully, but these errors were encountered: