Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hibernate-search] Introduce Hibernate Search framework and implement indexing page #6218

Open
wants to merge 30 commits into
base: hibernate-search
Choose a base branch
from

Conversation

matthias-ronge
Copy link
Collaborator

@matthias-ronge matthias-ronge commented Sep 4, 2024

Issue #5760 2a) and 2b)

Follow-up pull request to #6209 (immediate diff)

Recording

The three numbers before the slash in “Indexed entries” represent the number of objects that Hibernate has already loaded from the database, the number of objects that have been prepared as indexable documents (JSONs), and finally the number of indexed documents.

Basic experience: Hibernate Search and lazy loading don't mix. It looks like we have to accept that. As a result, I have deactivated lazy loading wherever the number of members of a set is typically small (< 25). This affects most sets, e.g. projects of a template, tasks, users or properties of a template or a process, etc. If the set can typically be large (> 1000), the elements of the set are not indexed. Example: Processes of a batch. Consideration: If the number of subelements to be indexed in an object is very large, the findability of the object approaches infinity (it becomes increasingly likely that it will be found with any search query). Such indexing also makes the index enormously large. Therefore, it can be considered justifiable not to index these fields.

@matthias-ronge matthias-ronge changed the base branch from master to hibernate-search September 4, 2024 14:59
@henning-gerhardt
Copy link
Collaborator

@matthias-ronge : a hopefully short general question: is it possible to use different indices with Hibernate-Search? Currently this is possible through different values with the elasticsearch.index configuration. Is this or something similar still possible? I'm asking because I'm working with different Kitodo.Production versions which has separated meta data directories on my local file system, different databases in a MariaDB database and different search prefixes in a ElasticSearch instance. This must not working in the current state of the changes nor is this a current goal but maybe something for later?

@matthias-ronge
Copy link
Collaborator Author

is it possible to use different indices with Hibernate-Search?

The index names for the individual objects are contained in the annotations as a string. I cannot estimate whether it is even possible to use variables here, or whether these have to be hard-coded strings at compile time; but I suspect the latter. Index access is controlled via properties such as port. You could install several index services on different ports and set the port at runtime before the program starts, or change the index data directory (as a symbolic link).

Such a feature is currently not in the scope of our development.

@henning-gerhardt
Copy link
Collaborator

henning-gerhardt commented Sep 9, 2024

Thank you @matthias-ronge for the explanation. I know and I did not expect that this usage scenario is part of the current development to use different hibernate search indices.

Edit: Maybe indexlayout-strategy-custom is a way to archive this. But this is nothing for now.

Kitodo-DataManagement/hibernate.properties Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an @Indexed(index = "kitodo-folder") annotation not missing like in the other bean files?

Comment on lines +1 to +3
hibernate.search.enabled=true
hibernate.search.backend.hosts=localhost:9205
hibernate.search.backend.protocol=http
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above at the first hibernate.properties file.

@@ -37,6 +37,9 @@
<property name="hibernate.connection.verifyServerCertificate">false</property>
<property name="hibernate.connection.useSSL">false</property>

<!-- Hibernate search -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are here the other Hibernate Search parameters are missing like used URI, port, ... which are added in the hibernate.properties file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that I also notice the similarity. Needs testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the Hibernate-Search properties are stored in one place / file. If this is not possible it would be bad at least for me.

Comment on lines +1 to +3
hibernate.search.enabled=true
hibernate.search.backend.hosts=localhost:9205
hibernate.search.backend.protocol=http
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the first hibernate.properties file.

Comment on lines +1 to +3
hibernate.search.enabled=true
hibernate.search.backend.hosts=localhost:9205
hibernate.search.backend.protocol=http
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the first hibernate.properties file.

@thomaslow
Copy link
Collaborator

@matthias-ronge I checked out your branch, and took notes of my testing experience.

  1. I built a new war file, and deployed it to Tomcat. At first, there was an error message in the log stating that hibernate-search could not connect to my elastic search instance (which is fine, because I do not run it on localhost).

Unable to detect the Elasticsearch version running on the cluster: HSEARCH400007: Elasticsearch request failed: Connection refused

  1. Then, I copied the hibernate.properties into my config-local directory, and changed the host name. Now the application starts without any error messages.

  2. I tried to log in to kitodo-production with my admin account. Nothing happens. The page keeps loading forever. No errors, but lots of CPU activity for mariadb and tomcat. Maybe indexes are being created in the background? But there is no user interface or message.

  3. After ~15minutes, the kitodo dashboard is shown, but my CPU is still active. The System - Indexing page does not show any progress. Only 0% everywhere. After ~30 minutes without any page loading, the CPU load is normal again. Maybe disabling lazy-loading triggers thousands of database queries when loading the dashboard? My test database contains ~80.000 processes.

Unfortunately, at this state, it is not possible to do further testing.

@matthias-ronge In case you have not done this yet, please test your branch with a large amount of test data. Otherwise, let me know, and I will try to figure out why pages are loading so slowly on my machine.

@thomaslow
Copy link
Collaborator

thomaslow commented Oct 21, 2024

I tried to start the indexing. Some entities were indexed within a few seconds. The remaining entities (processes, projects, tasks, templates) stay at 0% for at least the last 5 minutes.

image

After ~10 minutes all entities except processes and tasks were indexed at 100%. Processes and tasks have only 60 indexed entities (of 80.000 and 4.000 respectively).

@matthias-ronge
Copy link
Collaborator Author

Thank you for this testing and your insights. However, this is not as I expected. I have not tested with such large data yet, I will have to inspect it myself first. General assumption is that framework works reasonably well, it could be due to some small thing. If I can confirm it works for large data, I will let you know.

The code is not manually creating an index at startup, but I also saw it delay first, but only a few seconds. It is clear that I have to check this.

@BartChris
Copy link
Collaborator

BartChris commented Nov 4, 2024

4. Maybe disabling lazy-loading triggers thousands of database queries when loading the dashboard? My test database contains ~80.000 processes.

I logged the SQL statements having checked out the branch and just scrolling through the list of processes (10 per page) floods my database with queries. I have around 1000 processes in my database. It takes very long to jump to the next 10 entries.

Hundreds of requests are made for one page:

2024-11-04T09:01:04.540144Z	   23 Query	rollback
2024-11-04T09:01:04.540190Z	   23 Query	SET autocommit=1
2024-11-04T09:01:04.540239Z	   22 Query	SET autocommit=0
2024-11-04T09:01:04.540308Z	   22 Query	select batches0_.process_id as process_2_2_0_, batches0_.batch_id as batch_id1_2_0_, batch1_.id as id1_1_1_, batch1_.title as title2_1_1_, batch1_.type as type3_1_1_ from batch_x_process batches0_ inner join batch batch1_ on batches0_.batch_id=batch1_.id where batches0_.process_id=2310
2024-11-04T09:01:04.540415Z	   22 Query	rollback
2024-11-04T09:01:04.540450Z	   22 Query	SET autocommit=1
2024-11-04T09:01:04.540493Z	   23 Query	SET autocommit=0
2024-11-04T09:01:04.540573Z	   23 Query	select workpieces0_.process_id as process_1_36_0_, workpieces0_.property_id as property2_36_0_, property1_.id as id1_22_1_, property1_.choice as choice2_22_1_, property1_.creationDate as creation3_22_1_, property1_.dataType as datatype4_22_1_, property1_.obligatory as obligato5_22_1_, property1_.title as title6_22_1_, property1_.value as value7_22_1_ from workpiece_x_property workpieces0_ inner join property property1_ on workpieces0_.property_id=property1_.id where workpieces0_.process_id=2309
2024-11-04T09:01:04.540918Z	   23 Query	rollback
2024-11-04T09:01:04.540958Z	   23 Query	SET autocommit=1
2024-11-04T09:01:04.541003Z	   22 Query	SET autocommit=0
2024-11-04T09:01:04.541085Z	   22 Query	select templates0_.process_id as process_1_30_0_, templates0_.property_id as property2_30_0_, property1_.id as id1_22_1_, property1_.choice as choice2_22_1_, property1_.creationDate as creation3_22_1_, property1_.dataType as datatype4_22_1_, property1_.obligatory as obligato5_22_1_, property1_.title as title6_22_1_, property1_.value as value7_22_1_ from template_x_property templates0_ inner join property property1_ on templates0_.property_id=property1_.id where templates0_.process_id=2309
2024-11-04T09:01:04.541188Z	   22 Query	rollback
2024-11-04T09:01:04.541221Z	   22 Query	SET autocommit=1
2024-11-04T09:01:04.541262Z	   23 Query	SET autocommit=0

from time to time (while issuing many smaller queries as well) really complex queries are fired.

image

@matthias-ronge
Copy link
Collaborator Author

I can reproduce the error: For me it doesn't start with a larger database (8000 processes) either - or rather, it's still taking a while, I'm just waiting. I don't know why that is, it must be coming from the framework. It's not any code that I programmed that is being executed. I don't think it's good that it takes so long to start. I'm just waiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants