Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the performance issue in DataONE indexer #34

Open
taojing2002 opened this issue Dec 7, 2022 · 3 comments
Open

Investigate the performance issue in DataONE indexer #34

taojing2002 opened this issue Dec 7, 2022 · 3 comments
Assignees
Milestone

Comments

@taojing2002
Copy link
Collaborator

I deployed the DataONE Indexer instance on the dev cluster and installed a Metacat instance supporting RabbitMQ on test.arcticdata.io. I created a simple package with a single metadata and single data objects. It took more than 14 seconds to finish the indexing. The annotation processor took about eight seconds.

Matt suggested we need to compare performance of the DataONE indexer with the current Metacat indexer. Also, we can test it on the production cluster.

@taojing2002 taojing2002 added this to the 3.0.0 milestone Dec 7, 2022
@taojing2002 taojing2002 self-assigned this Dec 7, 2022
@taojing2002
Copy link
Collaborator Author

The initialize method in the OntologyModelService class takes long time to read the ontologies from the disk to a memory jena model. We moved the initialize method to the initialization process of the index worker and improved the performance during the object index process.

@taojing2002
Copy link
Collaborator Author

Now we have two issues:

  1. Iterate the SPARQL query results in the OntologyModelService takes long time (about four seconds). The details please see this ticket: jena.query.ResultSet.hasNext takes a long time in OntologyModelService.expandConcepts #43
  2. It takes long time (1.5 seconds) to send the processed solr document to the solr server and get response. In my local stand-alone java dataone-indexer, it takes about 0.1 second.

@artntek artntek modified the milestones: 3.0.0, 3.1.0 Feb 6, 2024
@artntek
Copy link
Collaborator

artntek commented Feb 6, 2024

From: #43
jena.query.ResultSet.hasNext takes a long time in OntologyModelService.expandConcepts #43
(dupe now closed)

In the dev cluster the jena.query.ResultSet.hasNext method takes about four seconds to finish. However, the second time to insert the same document, it almost takes 0 second to finish it. Somehow, there is a cache system there. The code looks like:

        Query query = QueryFactory.create(q);
        QueryExecution qexec = QueryExecutionFactory.create(query, ontModel);
        ResultSet results = qexec.execSelect();
        String name = field.getName();
        Set<String> values = new HashSet<String>();
         // results.hasNext() takes a long time
        while (results.hasNext()) {
          QuerySolution solution = results.next();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants