Hard to achieve high recall rate(recall@R) for huge workload #4190

EvagelineFEI · 2025-02-10T09:45:03Z

EvagelineFEI
Feb 10, 2025

Hello team,
I am using faiss.IndexHNSWFlat and my dataset is sift1b from http://corpus-texmex.irisa.fr/.
I add 10M data into the index and then I adjust the parameters(M, efc, efs)to improve the recall rate(recall@100).
I have tried many pairs,(M from 5 to 500) but my best result stuck at 0.93.
It doesn't seem that my precision calculation code has error. And it goes better for smaller dataset(like 1M,2M, where the precision can reach 0.96 or higher). What can be the problem?
Some of my machine info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
Stepping: 1
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.00

satymish · 2025-02-14T11:11:25Z

satymish
Feb 14, 2025
Collaborator

Hi @EvagelineFEI what's the data distribution like for the smaller datasets as well as the larger datasets? Is the data randomly sampled and provide uniform data distribution?

1 reply

EvagelineFEI Feb 15, 2025
Author

Thank you @satymish, for the smaller datasets, these subsets are the n first vectors of the bigann_base.bvecs file (n=1M,2M,5M,10M,20M,50M,100M,200M,500M,1B). (as mentioned in the official readme http://corpus-texmex.irisa.fr).
And for the data distribution of the larger datasets, are you referring to whether the dataset has distribution characteristics such as clustering, dimension correlation, sparsity, and intrinsic dimensionality?

mnorris11 · 2025-02-14T18:41:13Z

mnorris11
Feb 14, 2025
Collaborator

@EvagelineFEI how many query vectors are you using? All 10k from sift1b? And can you paste how you are computing recall@100?

1 reply

EvagelineFEI Feb 15, 2025
Author

Thank you @mnorris11, for the larger one I am using 10M vectors(10*100 0000), a subset of the huge ann_sift1b dataset.

    D, I = index.search(self.query, k=self.R) # self.R=100
    recall = self.compute_recall_rate(I) 
 def compute_recall_rate(self, I): 
        correct = 0  
        total = len(self.true_neighbors)  # The ground truth data provided by the dataset    
        for query_idx in range(I.shape[0]):
            predicted_neighbors = set(I[query_idx][:self.R]) # self.R=100
            if self.true_neighbors[query_idx][0] in predicted_neighbors:
                correct += 1
        recall_at_r = correct / total
        return recall_at_r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard to achieve high recall rate(recall@R) for huge workload #4190

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Hard to achieve high recall rate(recall@R) for huge workload #4190

EvagelineFEI Feb 10, 2025

Replies: 2 comments · 2 replies

satymish Feb 14, 2025 Collaborator

EvagelineFEI Feb 15, 2025 Author

mnorris11 Feb 14, 2025 Collaborator

EvagelineFEI Feb 15, 2025 Author

EvagelineFEI
Feb 10, 2025

Replies: 2 comments 2 replies

satymish
Feb 14, 2025
Collaborator

EvagelineFEI Feb 15, 2025
Author

mnorris11
Feb 14, 2025
Collaborator

EvagelineFEI Feb 15, 2025
Author