[hibernate-search] Implement indexing and search #6283

matthias-ronge · 2024-10-29T16:48:00Z

Issue #5760 2c), 2d), 2e) and 2f)

Follow-up pull request to #6218 (immediate diff)

The filters work. Metadata is indexed and searchable. The metadata search syntax works as documented in the wiki.

BartChris · 2024-10-29T18:04:19Z

Great work! Can you elaborate (short high level overview) how the metadata is indexed now in general? (For reference: #4266)

henning-gerhardt · 2024-10-30T07:55:50Z

Kitodo/src/main/java/org/kitodo/production/filters/FilterMenu.java

+        Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
+                .getRequestParameterMap();
+        if (parameters.containsKey("input") && StringUtils.isBlank(filterInEditMode)) {
+            filterInEditMode = parameters.get("input");


You are taking over any argument from the outside without validating that the given value is correct or at least is in range of the expectations?

Yes, the user can enter any string into the search slot. The input is not “checked”. But of course, if the user has entered nonsense, the search will not find any results.

Thank you for this clarification but I expected something different as I read the method, method content and the used written class variable filterInEditMode. Handling of this class variable filterInEditMode is strange in this class but this is a different story.

matthias-ronge · 2024-10-30T09:24:35Z

Can you elaborate (short high level overview) how the metadata is indexed now in general?

The metadata is stored in a text field that is indexed. The search is later carried out on the text field.

"Pseudo words" are used to search on certain metadata fields, i.e. if the word Berlin is in the TitleDocMain field in the metadata, a pseudo word "titledocmainqberlin" is also indexed. If a user later enters "TitleDocMain:Berlin" in the search field, this is converted into the pseudo word and searched for.

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

BartChris · 2024-10-30T16:35:15Z

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

Thanks a lot, this is helpful! Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

<key id="docType" use="docType">
           <label>Document Type</label>
           <label lang="de">Dokumenttyp</label>
</key>

And those combined terms (LABEL+VALUE) are all stored in the index? What happens if somebody changes the german label of a metadata key in the ruleset?

BartChris · 2024-10-30T17:00:43Z

Kitodo-DataManagement/src/main/java/org/kitodo/data/database/beans/Keyworder.java

+     * Creates the keywords for searching in correction messages. This is a
+     * double-truncated search, i.e. any substring.
+     * 
+     * @param comments
+     *            the comments of a process
+     * @return keywords
+     */
+    private static final Set<String> initCommentKeywords(List<Comment> comments) {
+        Set<String> tokens = new HashSet<>();
+        for (Comment comment : comments) {
+            String message = comment.getMessage();
+            if (StringUtils.isNotBlank(message)) {
+                for (String splitWord : splitValues(message)) {
+                    String word = normalize(splitWord);
+                    for (int i = 0; i < word.length(); i++) {
+                        for (int j = i + 1; j <= word.length(); j++) {
+                            tokens.add(word.substring(i, j));
+                        }
+                    }
+                }
+            }
+        }
+        return tokens;
+    }


Can you explain the reasoning here? Do i read this correctly and you are indexing every possible substring of all the words in a comment? Is this really necessary? Normal indexing should make it possible to search by word. Is this not enough?

The wiki page requires:

Eine Suche nach Meldung findet alle Vorgänge, die einen Kommentar mit einer Nachricht wie Das ist eine Korrekturmeldung! haben.

To achieve this, we need both-side subword search. I only follow the expected behaviour here. I would prefer a simple-word search, too, but this is what the requirement actually looks like. Maybe we can discuss this, if it is really needed, as it inflates the index. I don’t know a clever way of splitting a German compound-word to its single words, to index just these. (This is possible, but it would be a sophisticated task of computational linguistics, typically involving a dictionary file, too, that I don’t have.)

matthias-ronge · 2024-10-31T09:54:17Z

Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

Yes.

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

BartChris · 2024-10-31T10:32:15Z

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

I understand. I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label. I do not want to increase your workload too much, but would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

BartChris · 2024-10-31T11:15:54Z

Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

A more general remark here. I have not really tested anything so far, but it appears to me that almost everything gets indexed. My assumption would be that functionality like displaying list of processes (with all the necessary icons, which require lookups) is faster from the index since we can store data there in denormalized form. It is nice, that we can run Kitodo without a running index for certain functionality but my intuition would be to always use the index if that gives us an edge over database-based retrieval. Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

This is not necessarily adressed at the current PR, but more a general remark which can be considered.

matthias-ronge · 2024-10-31T13:53:48Z

I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label.

In my experience, rulesets are rarely changed at all during a running project, except perhaps the addition of a field; and I know librarians are very good at not making typos.

would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

The metadata is also indexed under the key string defined in the rule set, so you can always use the key name for the search.
Yes, of course, that would be possible, but I think it goes against the purpose of using search engines. It would be possible to implement this, but it would significantly slow down search queries because all relevant rulesets would have to be determined each time and checked to see whether a field name needs to be translated. I don't think that would be useful.

Perhaps it would be interesting to discuss this case among users. Such a change can be introduced later.

matthias-ronge · 2024-10-31T14:13:01Z

it appears to me that almost everything gets indexed.

Yes, a lot of indexing is still going on at the moment and I don't think that's necessary. But the task here was not to make more than necessary changes, but to keep the application as most as it was before. But I think that in a separate development, it will be possible to clean this up so that in the end only processes and tasks are indexed, as only these contain metadata. The other objects are never searched, so they don’t do something sensible in the index by now.

My assumption would be that functionality like displaying list of processes […] is faster from the index since we can store data there in denormalized form. […] Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

No, it's exactly the other way round: In Production 3.7, the display is taken from the index alone, as you describe here as a wish. And it is slow. In this version, this behavior was exactly inverted and now everything except the keyword search is taken from the database. On my developer laptop, the application has become significantly faster as a result.

Don't ask me why it behaves like this. This is another example of how performance is less about well-intentioned programming than about actual measurements. I suspect that both database queries and restoring Java in-memory objects from database rows are so much more sophisticated, having been researched for several decades longer, that it beats out indexing.

solth and others added 30 commits September 13, 2024 11:37

Add HibernateSearch dependencies to DataManagement module

f94697a

Declare Hibernate Search version in root POM, and bump to 6.2.4.Final

43dbc44

Add HibernateSearch annotations to base indexed classes

10ebf2f

Add index names, add 'Indexed' annotation

adcd4a1

Add annotations for complex fields

d1a5c97

Add annotations for Docket, Filter, Ruleset and Workflow

e8c841a

Show the indexing page if the search server is available

6c28a24

Remove Create Mapping and Delete Index buttons (henceforth implied)

40e5249

Remove button to index remaining - not supported by Hibernate Search

e97eba0

Index all objects of given 'objectType' with massIndexer

26280cf

Re-implement indexing page

faeed43

Fix checkstyle

40ef313

Improve wording, add Javadoc

11bcf85

Don't show a total of 0 objects when starting indexing

caf4bae

Returns result processing to the calling class

5b2e0ef

Remove test for mapping - created transparently

d6ae013

Set number of database objects

24a2005

Add template count

5fa7f85

Bring OpenSearch background instance for tests

732997c

Add MockDatabase index to Kitodo - DataManagement

ea5f0e3

Fix test

6c0eecc

Increase timeout (slow laptop)

5fa2d88

Fix problems

60bcacf

Fix search for ID

d97ac6a

Log all queries

9c20c89

Add missing file for tests

0c8374c

Fix tests

afd9ae8

Add Hibernate Search config file to Selenium resources

3369fa3

Remove unused imports

963e1c7

Add tasks to processes to enable sorting by sortHelperStatus

c1bbea7

matthias-ronge added 5 commits October 28, 2024 16:03

Implement search

3cf372b

Add search fields to Process and Task beans

d4c1e24

Fix indexing issues

6c76265

Get search to work, use IndexingService, some clean-up

6c02e98

Fix checkstyle (Kitodo - Data Management)

621dc51

matthias-ronge added 2 commits October 30, 2024 07:46

Fix checkstyle (Kitodo - Core)

0de38bf

Re-enable the global search slot

b4ab2fa

henning-gerhardt reviewed Oct 30, 2024

View reviewed changes

Fix some more things

8288ba7

Reduce log output

b336639

matthias-ronge force-pushed the 5760_2c branch from c378038 to c052fc4 Compare October 30, 2024 11:12

Do not crash tests with orphaned task objects

5bee9cb

matthias-ronge force-pushed the 5760_2c branch from c052fc4 to 5bee9cb Compare October 30, 2024 11:30

BartChris reviewed Oct 30, 2024

View reviewed changes

matthias-ronge added 3 commits October 31, 2024 10:36

Betterr logging

d4b7b5e

Fix tasks indexing

0649f6b

Improve log files

38dae76

Fix overgeneration of translated keys

fe389e9

matthias-ronge marked this pull request as ready for review October 31, 2024 13:53

matthias-ronge added 2 commits November 14, 2024 14:45

Remove what isn't absolutely necessary

1f6ac42

Prevent loading tasks in wrong Hibernate session for indexing

a21e0f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hibernate-search] Implement indexing and search #6283

[hibernate-search] Implement indexing and search #6283

matthias-ronge commented Oct 29, 2024 •

edited

Loading

BartChris commented Oct 29, 2024 •

edited

Loading

henning-gerhardt Oct 30, 2024

matthias-ronge Oct 30, 2024

henning-gerhardt Oct 30, 2024

matthias-ronge commented Oct 30, 2024

BartChris commented Oct 30, 2024 •

edited

Loading

BartChris Oct 30, 2024 •

edited

Loading

matthias-ronge Oct 31, 2024

matthias-ronge commented Oct 31, 2024

BartChris commented Oct 31, 2024 •

edited

Loading

BartChris commented Oct 31, 2024 •

edited

Loading

matthias-ronge commented Oct 31, 2024

matthias-ronge commented Oct 31, 2024

[hibernate-search] Implement indexing and search #6283

Are you sure you want to change the base?

[hibernate-search] Implement indexing and search #6283

Conversation

matthias-ronge commented Oct 29, 2024 • edited Loading

BartChris commented Oct 29, 2024 • edited Loading

henning-gerhardt Oct 30, 2024

Choose a reason for hiding this comment

matthias-ronge Oct 30, 2024

Choose a reason for hiding this comment

henning-gerhardt Oct 30, 2024

Choose a reason for hiding this comment

matthias-ronge commented Oct 30, 2024

BartChris commented Oct 30, 2024 • edited Loading

BartChris Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

matthias-ronge Oct 31, 2024

Choose a reason for hiding this comment

matthias-ronge commented Oct 31, 2024

BartChris commented Oct 31, 2024 • edited Loading

BartChris commented Oct 31, 2024 • edited Loading

matthias-ronge commented Oct 31, 2024

matthias-ronge commented Oct 31, 2024

matthias-ronge commented Oct 29, 2024 •

edited

Loading

BartChris commented Oct 29, 2024 •

edited

Loading

BartChris commented Oct 30, 2024 •

edited

Loading

BartChris Oct 30, 2024 •

edited

Loading

BartChris commented Oct 31, 2024 •

edited

Loading

BartChris commented Oct 31, 2024 •

edited

Loading