Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hibernate-search] Implement indexing and search #6283

Open
wants to merge 46 commits into
base: hibernate-search
Choose a base branch
from

Conversation

matthias-ronge
Copy link
Collaborator

@matthias-ronge matthias-ronge commented Oct 29, 2024

Issue #5760 2c), 2d), 2e) and 2f)

Follow-up pull request to #6218 (immediate diff)

The filters work. Metadata is indexed and searchable. The metadata search syntax works as documented in the wiki.

@BartChris
Copy link
Collaborator

BartChris commented Oct 29, 2024

Great work! Can you elaborate (short high level overview) how the metadata is indexed now in general? (For reference: #4266)

Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
.getRequestParameterMap();
if (parameters.containsKey("input") && StringUtils.isBlank(filterInEditMode)) {
filterInEditMode = parameters.get("input");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are taking over any argument from the outside without validating that the given value is correct or at least is in range of the expectations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the user can enter any string into the search slot. The input is not “checked”. But of course, if the user has entered nonsense, the search will not find any results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this clarification but I expected something different as I read the method, method content and the used written class variable filterInEditMode. Handling of this class variable filterInEditMode is strange in this class but this is a different story.

@matthias-ronge
Copy link
Collaborator Author

Can you elaborate (short high level overview) how the metadata is indexed now in general?

The metadata is stored in a text field that is indexed. The search is later carried out on the text field.

"Pseudo words" are used to search on certain metadata fields, i.e. if the word Berlin is in the TitleDocMain field in the metadata, a pseudo word "titledocmainqberlin" is also indexed. If a user later enters "TitleDocMain:Berlin" in the search field, this is converted into the pseudo word and searched for.

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

@BartChris
Copy link
Collaborator

BartChris commented Oct 30, 2024

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

Thanks a lot, this is helpful! Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

<key id="docType" use="docType">
           <label>Document Type</label>
           <label lang="de">Dokumenttyp</label>
</key>

And those combined terms (LABEL+VALUE) are all stored in the index? What happens if somebody changes the german label of a metadata key in the ruleset?

Comment on lines 344 to 367
* Creates the keywords for searching in correction messages. This is a
* double-truncated search, i.e. any substring.
*
* @param comments
* the comments of a process
* @return keywords
*/
private static final Set<String> initCommentKeywords(List<Comment> comments) {
Set<String> tokens = new HashSet<>();
for (Comment comment : comments) {
String message = comment.getMessage();
if (StringUtils.isNotBlank(message)) {
for (String splitWord : splitValues(message)) {
String word = normalize(splitWord);
for (int i = 0; i < word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
tokens.add(word.substring(i, j));
}
}
}
}
}
return tokens;
}
Copy link
Collaborator

@BartChris BartChris Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reasoning here? Do i read this correctly and you are indexing every possible substring of all the words in a comment? Is this really necessary? Normal indexing should make it possible to search by word. Is this not enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wiki page requires:

Eine Suche nach Meldung findet alle Vorgänge, die einen Kommentar mit einer Nachricht wie Das ist eine Korrekturmeldung! haben.

To achieve this, we need both-side subword search. I only follow the expected behaviour here. I would prefer a simple-word search, too, but this is what the requirement actually looks like. Maybe we can discuss this, if it is really needed, as it inflates the index. I don’t know a clever way of splitting a German compound-word to its single words, to index just these. (This is possible, but it would be a sophisticated task of computational linguistics, typically involving a dictionary file, too, that I don’t have.)

@matthias-ronge
Copy link
Collaborator Author

Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

Yes.

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

@BartChris
Copy link
Collaborator

BartChris commented Oct 31, 2024

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

I understand. I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label. I do not want to increase your workload too much, but would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

@BartChris
Copy link
Collaborator

BartChris commented Oct 31, 2024

Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

A more general remark here. I have not really tested anything so far, but it appears to me that almost everything gets indexed. My assumption would be that functionality like displaying list of processes (with all the necessary icons, which require lookups) is faster from the index since we can store data there in denormalized form. It is nice, that we can run Kitodo without a running index for certain functionality but my intuition would be to always use the index if that gives us an edge over database-based retrieval. Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

This is not necessarily adressed at the current PR, but more a general remark which can be considered.

@matthias-ronge
Copy link
Collaborator Author

I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label.

In my experience, rulesets are rarely changed at all during a running project, except perhaps the addition of a field; and I know librarians are very good at not making typos.

would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

  1. The metadata is also indexed under the key string defined in the rule set, so you can always use the key name for the search.
  2. Yes, of course, that would be possible, but I think it goes against the purpose of using search engines. It would be possible to implement this, but it would significantly slow down search queries because all relevant rulesets would have to be determined each time and checked to see whether a field name needs to be translated. I don't think that would be useful.

Perhaps it would be interesting to discuss this case among users. Such a change can be introduced later.

@matthias-ronge matthias-ronge marked this pull request as ready for review October 31, 2024 13:53
@matthias-ronge
Copy link
Collaborator Author

it appears to me that almost everything gets indexed.

Yes, a lot of indexing is still going on at the moment and I don't think that's necessary. But the task here was not to make more than necessary changes, but to keep the application as most as it was before. But I think that in a separate development, it will be possible to clean this up so that in the end only processes and tasks are indexed, as only these contain metadata. The other objects are never searched, so they don’t do something sensible in the index by now.

My assumption would be that functionality like displaying list of processes […] is faster from the index since we can store data there in denormalized form. […] Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

No, it's exactly the other way round: In Production 3.7, the display is taken from the index alone, as you describe here as a wish. And it is slow. In this version, this behavior was exactly inverted and now everything except the keyword search is taken from the database. On my developer laptop, the application has become significantly faster as a result.

Don't ask me why it behaves like this. This is another example of how performance is less about well-intentioned programming than about actual measurements. I suspect that both database queries and restoring Java in-memory objects from database rows are so much more sophisticated, having been researched for several decades longer, that it beats out indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants