-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize SPSC queue #121
Optimize SPSC queue #121
Conversation
de1f9aa
to
f543aca
Compare
859b2dd
to
5cc638e
Compare
181b99b
to
204f2ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure the commit Avoid float array pessimization and a bit of the previous one (removing the indirection of the array and thus doing initialization with Obj.magic
) are a good idea. They introduce a lot of Obj.magic
which I feel we should be careful with, especially because it could cause issues with flambda.
Also, what happen if you are actually putting float in the queue ? I am guessing the array float special case will not be triggered ?
I have no immediate solution for float array pessimization, but for initialization, we could have a make : size_exponent:int -> 'a -> 'a
function instead of the create
function that requires the user to provide an inhabitant for type 'a
.
It may just move the issue to the user lever that may want to either use the same trick with Obj.magic
or use and option
type though...
Thanks for the PR ! All the optimizations (except maybe the ones mentioned in my previous message :p) are great and thanks again for making the git history so readable. And obviously, the gain in performance is amazing ! |
I should have added that the optimization with the cashes is very cool !! |
82cdf25
to
8b0c40d
Compare
c86dcca
to
034c23b
Compare
034c23b
to
c816fde
Compare
This should stabilize performance results to some degree, but, as it is, there is still a massive amount of contention so performance results will still likely have a lot of randomness. This is also one of two crucial optimizations. Without this optimization best performance simply cannot be obtained.
Transparent atomic provides a workaround for the OCaml 5.(0|1) CSE bugs around `Atomic.get`s. In this case, invalid CSE optimization on `Atomic.get`s can actually break the queue semantics. Fences are not really needed in this specific queue with the exception of the updates of the head and tail indices and even there the fences are only needed to ensure that the other side sees the index update only after updates to the array. `Atomic.incr` is potentially a little bit faster, because it needs no write barrier.
Wide pattern matches, even with immutable fields, tend to generate inefficient code in OCaml at the moment. Specifically, the compiler tends to generate a sequence of loads for the matched fields at the start of the function. The problem with this is that it increases register pressure, which may lead to spilling (i.e. extra reads and writes to stack), and may also lead to fetching some fields unnecessarily (i.e. wasted work) as those fields are only accessed conditionally. It is typically better to access immutable fields using dot notation. The compiler performs CSE on such loads so repeated accesses of immutable fields do not necessarily generate repeated loads of those fields.
This reduces contention as operations can be performed on either side in parallel without interference from the other side. Without this algorithmic change, most of the other optimizations become irrelevant, because the contention on index accesses would simply limit the peak performance by an order of magnitude or more. Try removing this change and observe the massive drop in peak performance.
The `mask` can be computed very quickly from the length of the array which is known to be a power of two. Avoiding the indirection means that an allocation, some pointer chasing, and potentially a couple of cache misses (an extra object needs to be written by the producer (a potential cache miss) and read by consumer (a definite cache miss, because the memory is written by the producer)) are avoided per message. This provides a significant performance improvement now that contention has been reduced. Using `Obj.magic ()` with an array that may contain `float` values is prone to be problematic and it makes sense to test that things work.
This tweaks the code for checking whether the queue is full or empry for minor improvement.
Also report out of range `size_exponent` an invalid argument
OCaml currently has a special case for `float array`. This means than whenever an array is accessed that might be a `float array`, the compiler generates an extra conditional to treat the array properly. These conditionals slow down array accesses slightly. Here we hack around that by specifying that the array is of elements of a "pointer" type. This way the compiler will generate array accesses with potential write barriers and avoid generating code for the `float array` special case. This improves performance.
This reduces code size at the expense of one or two conditionals. The time taken by those conditionals is fairly insignificant. The reduction in code size will likely improve performance in real applications, because pressure on the instruction cache is reduced. This approach is likely to be useful in various other data structure implementations and is likely often preferable to inlining to avoid code duplication.
This prevents Flamda from noticing that we are potentially storing float values into a non-float array.
c816fde
to
b6e62a1
Compare
The changes proposed in this PR have already been merged in two different PRs (#133 and #135) that separate "safe" and "unsafe" optimizations. Thanks @polytypic for all this work! |
This PR optimizes the SPSC queue.
See the individual commits to better understand the motivation behind the optimizations.
Here is a run of the benchmarks on my M3 Max before the optimizations:
Here is a run of the benchmarks after the optimizations:
TODO: