Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing optimization #741

Closed
hectorcorrea opened this issue Jan 14, 2025 · 0 comments · Fixed by #752
Closed

Indexing optimization #741

hectorcorrea opened this issue Jan 14, 2025 · 0 comments · Fixed by #752
Assignees

Comments

@hectorcorrea
Copy link
Member

hectorcorrea commented Jan 14, 2025

We are currently indexing the file list for each record twice, one as part of pdc_describe_json_ss and another as files_ss. This is not a problem for small datasets but when we have datasets with 60K files this is rather inefficient.

There is really no need to index the second field (files_ss) since pdc_describe_json_ss has the file list already.

Making this change will require calculating the file list (rather than storing it in Solr) from the data in pdc_describe_json_ss while keeping in place the logic in the https://github.com/pulibrary/pdc_discovery/blob/main/config/traject/pdc_describe_indexing_config.rb#L244-L261

Looks like staging is sometimes having issues processing some of our large datasets. Sometimes they succeed sometimes they don't (see details in Honeybadger https://app.honeybadger.io/projects/95072/faults/116339390). The fix in this issue might help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants