Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a random subset of vectors to generate trees #87

Open
Kerollmops opened this issue Aug 26, 2024 · 1 comment
Open

Use a random subset of vectors to generate trees #87

Kerollmops opened this issue Aug 26, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@Kerollmops
Copy link
Member

Kerollmops commented Aug 26, 2024

The current indexer uses the full set of vectors to generate the trees. The advantage is that the split planes represent the vectors in the database. The cons is that it takes a lot of memory, even if it is memory-mapped.

So, to reduce the memory used to index the vectors, we should only use a subset of the vectors to generate the tree and then add the other vectors, those not used to create the planes, back into the tree. With or without refining the planes, I don't know, but probably without. Those vectors can be added by using incremental insertion.

@Kerollmops Kerollmops added the enhancement New feature or request label Aug 26, 2024
@irevoire
Copy link
Member

Here's the list of steps we're going to take to solve this issue:

  1. Reproduce the issue
  2. Try multiple strategies to mitigate the issue by reducing the set of items used in two_means. From the easiest one to the most complex one:
  3. Naively select a random set of candidates => We might not have as many vectors as we want; we should log that
  4. Load a contiguous set of candidates (ex: all id between rand and rand+nb_vectors_that_fits_in_RAM) => Might decrease the relevancy
  5. Load a random set of vectors, and for each vector, load all the contiguous vectors on the same page => This should increase the number of vectors we can load without increasing the RAM or read consumption
  6. If these mitigation are not enough we need to meet up again and there may be another solution by doing multiple incremental update instead of one large update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants
@Kerollmops @irevoire and others