Indexing optimization #741

hectorcorrea · 2025-01-14T14:52:19Z

We are currently indexing the file list for each record twice, one as part of pdc_describe_json_ss and another as files_ss. This is not a problem for small datasets but when we have datasets with 60K files this is rather inefficient.

There is really no need to index the second field (files_ss) since pdc_describe_json_ss has the file list already.

Making this change will require calculating the file list (rather than storing it in Solr) from the data in pdc_describe_json_ss while keeping in place the logic in the https://github.com/pulibrary/pdc_discovery/blob/main/config/traject/pdc_describe_indexing_config.rb#L244-L261

Looks like staging is sometimes having issues processing some of our large datasets. Sometimes they succeed sometimes they don't (see details in Honeybadger https://app.honeybadger.io/projects/95072/faults/116339390). The fix in this issue might help here.

The text was updated successfully, but these errors were encountered:

hectorcorrea mentioned this issue Jan 14, 2025

Indexing strategy for large datasets #738

Open

hectorcorrea assigned claudiawulee Feb 5, 2025

hectorcorrea mentioned this issue Feb 14, 2025

Stop indexing the file list twice #752

Merged

bess closed this as completed in #752 Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing optimization #741

Indexing optimization #741

hectorcorrea commented Jan 14, 2025 •

edited

Loading

Indexing optimization #741

Indexing optimization #741

Comments

hectorcorrea commented Jan 14, 2025 • edited Loading

hectorcorrea commented Jan 14, 2025 •

edited

Loading