General guidance on searching a large dataset for many many times #2604

jasperhyp · 2022-12-03T23:44:37Z

jasperhyp
Dec 3, 2022

Hi! I checked the documents https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index and https://github.com/facebookresearch/faiss/wiki/The-index-factory, but I am so ignorant about this stuff and just can't completely get which combinations of vector transforms, IVF/HNSW, and Encodings I should use.

My use case is to do sequence retrieval for NLP, where the training dataset contains ~1G embeddings (original dim=768), and the query each pass contains ~1000 embeddings. I can accept relatively lower recall (but not that low) because the speed is very very important and I'll be searching many many times (>1k passes each epoch and >1k epochs => 1M times of search).

I have 32GB GPUs and ~100 GB RAM. I'm hoping to avoid multi-GPU for indexing at this stage. I'm thinking about using the code here: https://gist.github.com/mdouze/46d6bbbaabca0b9778fca37ed2bcccf6 and changing the index to OPQ256,IVF262144_HNSW32,PQ16x4fsr (I made this up very randomly...). Would it work or should I change anything?

jasperhyp · 2022-12-04T01:41:08Z

jasperhyp
Dec 4, 2022
Author

ok i found a good library called Autofaiss, might be helpful for others as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General guidance on searching a large dataset for many many times #2604

{{title}}

Replies: 1 comment

{{title}}

Select a reply

General guidance on searching a large dataset for many many times #2604

jasperhyp Dec 3, 2022

Replies: 1 comment

jasperhyp Dec 4, 2022 Author

jasperhyp
Dec 3, 2022

jasperhyp
Dec 4, 2022
Author