Use a random subset of vectors to generate trees #87

Kerollmops · 2024-08-26T20:55:28Z

The current indexer uses the full set of vectors to generate the trees. The advantage is that the split planes represent the vectors in the database. The cons is that it takes a lot of memory, even if it is memory-mapped.

So, to reduce the memory used to index the vectors, we should only use a subset of the vectors to generate the tree and then add the other vectors, those not used to create the planes, back into the tree. With or without refining the planes, I don't know, but probably without. Those vectors can be added by using incremental insertion.

irevoire · 2025-02-18T14:19:50Z

Here's the list of steps we're going to take to solve this issue:

Reproduce the issue
Try multiple strategies to mitigate the issue by reducing the set of items used in two_means. From the easiest one to the most complex one:
Naively select a random set of candidates => We might not have as many vectors as we want; we should log that
Load a contiguous set of candidates (ex: all id between rand and rand+nb_vectors_that_fits_in_RAM) => Might decrease the relevancy
Load a random set of vectors, and for each vector, load all the contiguous vectors on the same page => This should increase the number of vectors we can load without increasing the RAM or read consumption
If these mitigation are not enough we need to meet up again and there may be another solution by doing multiple incremental update instead of one large update

Kerollmops added the enhancement label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a random subset of vectors to generate trees #87

Use a random subset of vectors to generate trees #87

Kerollmops commented Aug 26, 2024 •

edited

Loading

irevoire commented Feb 18, 2025

Use a random subset of vectors to generate trees #87

Use a random subset of vectors to generate trees #87

Comments

Kerollmops commented Aug 26, 2024 • edited Loading

irevoire commented Feb 18, 2025

Kerollmops commented Aug 26, 2024 •

edited

Loading