[Cache] Consider increasing concurrency level from 1 to a potentially higher/default value #2557

sgup432 · 2025-02-25T08:17:50Z

I noticed that QuantizationCache which uses guava cache as an underlying implementation uses concurrency level as 1 here.

Same goes for NativeCacheManager and ModelCache

I wanted to know if this is intentional?

Concurrency level internally controls the number of partitions within the cache, where each partition holds its own write/read lock. With concurrency level=1, you can essentially have only 1 thread writing into the cache at a time.
With multiple writers, you will see degraded read/write performance.

Would be better to increase it further(4, 16 etc) or use the default which is 4. A quick micro benchmarking would confirm this.

If you're looking for even better performance, give Caffeine a try—it's faster 😁

jmazanec15 · 2025-02-26T16:17:22Z

Thanks @sgup432. Its probably worth checking out. Traditionally, the cache does not require very much write throughput as opposed to read throughput, and so thats why concurrency level 1 is set. However, I dont see how this could hurt. wdyt @kotwanikunal

ben-manes · 2025-02-26T16:33:09Z

FYI - It actually impacts read throughput too.

Guava splits the maximum weight between partitions so that might have been a reason if there was fear about large entries. Or maybe for testing to have strict LRU.

Caffeine doesn’t partition the eviction policy and doesn’t need a concurrency level. It does default to async eviction (as user functions have an unknown cost, fine to disable) and adds randomness to eviction to prevent deterministic attacks. So testing should focus on behavior instead of state, if that was an original concern.

kotwanikunal · 2025-02-26T19:50:03Z

I think you get a good point up @sgup432. And thanks for your inputs @ben-manes.

I can comment majorly for NativeMemoryCacheManager -

The way we use the cache is like a representative structure for the off heap memory objects.
In case of high throughput cases, we are memory bound - you can only have so many graph files that you can load on the off heap memory. (I would say order of low 100s would be a fair estimation)

The reason we use the cache is that we utilize the baked in TTL + LRU mechanisms. We could potentially achieve the latter with just a PQ with a custom Comparator.

In fact, recently with #2015 - we added in a force eviction mechanism to maintain the memory structures in sync with the off heap memory to avoid OOM issues on nodes. The solution limits cache entries to 1 thread at a time in case of a cache miss + overflow scenario.
This was majorly done to work around the async clean up nature of Guava - which frequently led to node drops because the cleanup happened after we added a new entry to the cache + initialized it in the off heap memory - leading to OOM. This solution is just a stop gap and chooses concurrency limitations over node drops (latency hits over instability).

@ben-manes @sgup432 I was reading through the Caffeine docs and it seemed pretty good for high throughput use cases. Do you have any suggestions on how can we work around the above described use case? I was inclined towards a custom PQ/tracking mechanism as cache seems like an overkill for our use case.

Also - we cannot segment our off heap memory to be reflective of the segmented cache - a single graph file can consume the entire off heap memory as well in certain scenarios.

sgup432 · 2025-02-26T21:31:43Z

The solution limits cache entries to 1 thread at a time in case of a cache miss + overflow scenario.
This was majorly done to work around the async clean up nature of Guava - which frequently led to node drops because the cleanup happened after we added a new entry to the cache + initialized it in the off heap memory

I am not sure how keeping concurrency level to 1 actually solves this. Firstly, I always thought Guava cleanup happens on the same thread(which is writing to the cache) after checking the size/time criteria and not in an async manner(ie on a different thread)? Even if it doesn't, concurrency level and cache doing cleanup logic in async manner are not related in any manner IMO. Maybe @ben-manes can comment on that.

Also additionally, ideally force evictions logic should be agnostic of underlying concurrency level and should work with multiple partitions as well?

Also - we cannot segment our off heap memory to be reflective of the segmented cache - a single graph file can consume the entire off heap memory as well in certain scenarios.

No sure if I get this entirely. Considering you are using onheap cache to track memory in off-heap, with multiple segments, you would give each segment = N/num_of_segments where N is total size of cache and each segment independently tracks its own size. This would similarly avoid overconsumption of off-heap memory?

ben-manes · 2025-02-26T22:40:03Z

I'd have recommended OHC but it is no longer maintained. It is an off-heap native cache previously used by Cassandra, which now uses a custom allocator. I don't know enough, but I think they use Caffeine to wrap off-heap buffers, likely similar to your cache (ChunkCache or SlabAllocator).

force eviction mechanism to maintain the memory structures in sync with the off heap memory... This was majorly done to work around the async clean up nature of Guava

I think you are referring to your removalListener being called outside of the eviction lock to perform the clean up, whereas @sgup432 is correct that the LRU eviction itself happens on the caller's thread. Guava forked Java 5's segmented hash table, predating computes, so running foreign user code under the segment lock could be too coarse. Instead the removal is added to a ConcurrentLinkedQueue and any caller is able to drain and call it outside of the hash table locks. That makes it non-atomic, which is typically fine except when synchronizing data structures.

Caffeine is layered above a Java 8 ConcurrentHashMap which was rewritten to use hash bin locks, where a bin is typically 1-3 entries, so larger caches have higher write concurrency. This allowed them to add compute methods that were linearizable. Caffeine's evictionListener is run within the eviction's computeIfPresent to remove the entry, which allows you to keep data structures in sync by having exclusive computations to run your logic in (docs).

Also - we cannot segment our off heap memory to be reflective of the segmented cache - a single graph file can consume the entire off heap memory as well in certain scenarios.

@sgup432 I think you missed the cannot segment, as in they want one entry to be able to use all of the cache's capacity. The N/S behavior of Guava hurts here, but that is not the case in Caffeine.

sgup432 · 2025-02-26T22:47:28Z

@sgup432 I think you missed the cannot segment, as in they want one entry to be able to use all of the cache's capacity. The N/S behavior of Guava hurts here, but that is not the case in Caffeine.

Ah yeah. In that case segmenting would hurt them.

kotwanikunal · 2025-02-28T18:03:33Z

I am not sure how keeping concurrency level to 1 actually solves this. Firstly, I always thought Guava cleanup happens on the same thread(which is writing to the cache) after checking the size/time criteria and not in an async manner(ie on a different thread)? Even if it doesn't, concurrency level and cache doing cleanup logic in async manner are not related in any manner IMO. Maybe @ben-manes can comment on that.

I think what I was referring to was we have a bigger problem than solving for the concurrency level in terms of the cache access behavior. Until we find a better approach than force eviction - no amount of concurrency would be useful if we have a forcing bottleneck.

I think you are referring to your removalListener being called outside of the eviction lock to perform the clean up, whereas @sgup432 is correct that the LRU eviction itself happens on the caller's thread.

Yes. I was referring to the removal listener and not the eviction itself.

Caffeine is layered above a Java 8 ConcurrentHashMap which was rewritten to use hash bin locks, where a bin is typically 1-3 entries, so larger caches have higher write concurrency. This allowed them to add compute methods that were linearizable. Caffeine's evictionListener is run within the eviction's computeIfPresent to remove the entry, which allows you to keep data structures in sync by having exclusive computations to run your logic in (docs).

This is nice. We could potentially try exploring this.

github-actions bot added the untriaged label Feb 25, 2025

sgup432 changed the title ~~[QuantizationCache] Consider increasing concurrency level from 1 to a potentially higher/default value~~ [Cache] Consider increasing concurrency level from 1 to a potentially higher/default value Feb 25, 2025

navneet1v added Enhancements Increases software capabilities beyond original client specifications and removed untriaged labels Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cache] Consider increasing concurrency level from 1 to a potentially higher/default value #2557

[Cache] Consider increasing concurrency level from 1 to a potentially higher/default value #2557

sgup432 commented Feb 25, 2025 •

edited

Loading

jmazanec15 commented Feb 26, 2025

ben-manes commented Feb 26, 2025

kotwanikunal commented Feb 26, 2025

sgup432 commented Feb 26, 2025

ben-manes commented Feb 26, 2025

sgup432 commented Feb 26, 2025

kotwanikunal commented Feb 28, 2025

[Cache] Consider increasing concurrency level from 1 to a potentially higher/default value #2557

[Cache] Consider increasing concurrency level from 1 to a potentially higher/default value #2557

Comments

sgup432 commented Feb 25, 2025 • edited Loading

jmazanec15 commented Feb 26, 2025

ben-manes commented Feb 26, 2025

kotwanikunal commented Feb 26, 2025

sgup432 commented Feb 26, 2025

ben-manes commented Feb 26, 2025

sgup432 commented Feb 26, 2025

kotwanikunal commented Feb 28, 2025

sgup432 commented Feb 25, 2025 •

edited

Loading