Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vector search speed by using FixedBitSet #12789

Conversation

benwtrent
Copy link
Member

While doing some performance testing and digging into flamegraphs, I noticed for smaller vectors (96dim float32), we were losing a fair bit of time within the SparseFixedBitSet#getAndSet method.

I am assuming we are using SparseFixedBitSet for performance reasons to reduce memory usage?

I ran some tests with topK=100 and fanOut=1000. To check memory usage, in a separate run, I printed out bitSet.ramBytesUsed() after every search.

I tested using FixedBitSet instead with GLOVE and saw almost a 10% improvement in search speed:

completed 1000 searches in 803 ms: 1245 QPS CPU time=801ms
checking results
0.695	 0.80	100000	1000	16	100	1100	0	1.00	post-filter

Vs. baseline

completed 1000 searches in 873 ms: 1145 QPS CPU time=873ms
checking results
0.695	 0.87	100000	1000	16	100	1100	0	1.00	post-filter

The total ramBytesUsed (allocated and then gc'd collected) for 1000 searches over glove was 21288656 bytes. For fixed bit set, every search only allocates 12544, which pans out to 12544000 bytes (actually less than sparse).

To confirm this was still true for larger vectors and a larger graph, I tested against 400k cohere vectors (same params). There is a bit more noise in the measurements, so I averaged the latency over 4 runs:

candidate: 6.115 with a min: 5.96

baseline: 6.23 with a min: 6.15.

Total memory usage for sparse: 103982464 vs for total memory usage for Fixed 50040000

Do we know the goal for using a SparseFixedBitSet and under what conditions it would actually perform better than a regular FixedBitSet? I will happily test some more.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@jpountz
Copy link
Contributor

jpountz commented Nov 9, 2023

I can believe that FixedBitSet is faster in some cases, but it's surprising to me that the memory usage of SparseFixedBitSet can go up to 2x that of FixedBitSet, this makes me wonder if SparseFixedBitSet#ramBytesUsed is buggy. A SparseFixedBitSet that has all its bits set (worst-case scenario for memory usage) has an overhead of one long, one object reference and and one array every 4096 bits = 64 longs, which doesn't feel to me it should ever use 2x more heap?

Separately, I wonder if you know the number of nodes that get visited (ie. the number of bits that end up being set) in your benchmark? Is it a force-merged index or do you have multiple segments?

@benwtrent
Copy link
Member Author

benwtrent commented Nov 9, 2023

@jpountz I re-ran my tests and double checked my numbers, I have some corrections, I accidentally double-counted sparse sizes, so previous numbers are 2x too big.

GLOVE-100-100_000:
sparse_total_usage: 10644328
fixed_total_usage: 12544000
num_nodes_visited_avg: 3561
num_nodes_visited_min: 1971
num_nodes_visited_max: 6904

Cohere-768-400_000:
sparse_total_usage: 51991232
fixed_total_usage: 50040000
num_nodes_visited_avg: 12324
num_nodes_visited_min: 3213
num_nodes_visited_max: 19979

EDIT:

@jpountz to confirm the sparse fixed bitset memory usage, I rewrote the ramBytesUsed to be more exact (instead of summing up the estimation).

  @Override
  public long ramBytesUsed() {
    long bytesUsed = BASE_RAM_BYTES_USED;
    bytesUsed += RamUsageEstimator.sizeOf(indices);
    bytesUsed += RamUsageEstimator.shallowSizeOf(bits);
    for (long[] bitArray : bits) {
      if (bitArray != null) {
        bytesUsed += RamUsageEstimator.sizeOf(bitArray);
      }
    }
    return bytesUsed;
  }

Obviously, this ran slightly slower, but from what I found, this didn't reduce the memory estimation. I still got 51991232

@jpountz
Copy link
Contributor

jpountz commented Nov 10, 2023

Thanks, the numbers make more sense to me now.

Intuitively, FixedBitSet performs better when a large percentage of nodes needs to be visited and SparseFixedBitSet performs better otherwise. Practically, the smaller segments of an index should probably always use a FixedBitSet. E.g. a simple threshold may consist of using SparseFixedBitSet when we would expect it to use less memory than FixedBitSet, ie. when less than 1/64 = 1.5% of the nodes get visited (or possibly a bit less: if both SparseFixedBitSet and FixedBitSet use similar amounts of memory, it probably makes sense to bias towards FixedBitSet) and FixedBitSet otherwise. I see that your benchmark visits between 2.0% and 6.9% of the nodes on GLOVE and between 0.8% and 5.0% on Cohere, so it makes sense to me that FixedBitSet performs better.

Is it possible to estimate the order of the number of nodes that a nn search needs to visit, so that we could use it as a threshold?

@benwtrent
Copy link
Member Author

@jpountz searching scales logarithmically, but we do have to explore more if there are any pre-filtered nodes.

We can run some experiments to determine the appropriate threshold. I imagine it will be something along the lines of topK * log(graphSize) with some constant scaling applied.

@jpountz
Copy link
Contributor

jpountz commented Nov 15, 2023

++ This feels similar to IndexOrDocValuesQuery: we probably can't guess the absolute best threshold, but we can probably figure out something that is right more often than wrong. Hopefully we can keep it simple and not include maxConn and other parameters in the equation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants