-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster fast-merge #1
Comments
Nice! This is really interesting! |
Feel free to add away, just reference me in the comment if you do please. Maybe it helps others find a common repository of optimized functions. You should maybe benchmark as well to make sure my results are consistent as claimed. I'm compiling some stuff specifically for low level perf (really just regaining performance that's left on the floor in clojure.core, primarily due to overly polymorphic code and naive implementations) while optimizing a couple of different projects (and incorporating old lessons), like the various performance optimization against rust here and little libs like structural. Another area that sucks, fyi, is the default clojure.core/memoize. Can be sped up significantly with a concurrent mutable hashmap, and if you have a discrete number of args, you can get it way faster with a specialized function that uses multiple hashmaps vs. the stock & args variadic version. |
Will do, expect it in the coming week :) And now I realize where I recognized your name from. I was really surprised to see Finally regarding memoize, yes and every place where I'll also have a look at the other repo you linked, seems interesting. Thanks. |
brief update: Looks like your suggestion is faster under certain conditions, if not all. I'm reworking the benchmarks ns to get higher quality results and analysis, then it'll get in to the next release |
Added to latest version :) |
I just noticed in recent testing that the original code I submitted didn't invoke transient and rmerge! correctly for the 2-arity version. No problem though, because your implementation does! Good job :) |
Just ran into the myopic realization that I assumed input supports IKVReduce without a fallback.. (defn rmerge! [l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l (fn [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc)) r)
(clojure.core.protocols/kv-reduce l (fn [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc)) r))) This version will do a (meager) instance check to see if we can use the fast path, and fall back to the protocol invocation (e.g. for user defined types) otherwise. |
Good catch. I reopened the issue and will close it again once I add it. Thank you :) |
Very curious to see |
From a cursory examination of your benchmarks, your tests are generating random maps via Could be that a better (specific) name for the reverse-optimized merge operation would be |
Caught another possible issue with your map generation code, dunno if it matters necessarily (likely only matters for the int? case). When you generate random maps via clj-fast.bench> (frequencies (map count (repeatedly 1000 #(randmap int? 1000) )))
{789 12,
765 17,
798 1,
779 28,
774 38,
769 26,
799 2,
756 11,
806 1,
770 42,
777 33,
788 9,
773 45,
752 3,
797 4,
763 20,
805 1,
743 1,
776 36,
755 15,
758 18,
759 23,
749 6,
781 18,
796 7,
785 20,
782 21,
792 6,
761 23,
794 4,
790 4,
784 15,
800 1,
739 1,
768 40,
778 27,
775 46,
760 19,
803 1,
783 23,
772 46,
757 13,
786 8,
787 14,
771 32,
751 7,
802 2,
793 4,
746 5,
753 9,
766 26,
795 4,
762 27,
747 3,
767 43,
736 1,
748 4,
750 4,
780 34,
754 11,
791 3,
744 3,
764 29} |
Existing implementation of tmerge in clj-fast was inefficient. removed superflous instance checks and let binding, gets the performance down to fast-map-merge equivalence. |
Added a test to the benchmarks that has parameter-driven proportional map key sharing for the maps (e.g. 10%, up to 100% of the key width are shared across the random maps). Baseline profiling with 50% shared keys shows tmerge dominating as expected. Will submit a PR for analysis in a bit. |
Lol, now getting counter intuitive results where pruning existing keys should obviously dominate relative to the tradeoff of associng everything naively. Dissecting and profiling. |
I also started reworking the map generation code. I don't really care about the maps being random besides when testing merge, then your (defonce -s "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
(defonce -strs (for [a -s b -s c -s d -s e -s] (str a b c d e)))
(defonce -kws (map keyword -strs))
(defonce -syms (map symbol -strs))
(defonce r (range))
(def gens
{:keyword? (fn [n] (take n -kws))
:map? (fn [n] (drop 1 (gen/sample (s/gen map?) (inc n))))
:string? (fn [n] (take n -strs))
:int? (fn [n] (take n r))
:integer? (fn [n] (take n r))
:symbol? (fn [n] (take n -syms))})
(defn genn!
[n spec]
(drop 1 (gen/sample (s/gen spec) (inc n))))
(defn genn
[n p]
((gens p) n))
(defn randmap!
([n]
(randmap! keyword? n))
([p n]
(into {} (genn! n (s/tuple p p)))))
(defn randmap
#_([n]
(randmap keyword? n))
([p n]
(let [coll (genn n p)]
(zipmap coll coll))))
(defonce mrandmap (memoize randmap))
(declare mrand-nested-map)
(defn rand-nested-map
[p width depth]
(if (= 1 depth)
(mrandmap p width)
(zipmap (genn width p)
(repeat width (mrand-nested-map p width (dec depth))))))
#_(defn rand-nested-map
[p width depth]
(if (= 1 depth)
(mrandmap p width)
(zipmap (genn width p)
(repeat width (mrand-nested-map p width (dec depth))))))
(defonce mrand-nested-map (memoize rand-nested-map))
(def preds identity)
(def preds!
{:int? int?
:keyword? keyword?
:string? string?
:map? map?
:symbol? symbol?}) |
Another option is instead of generating a bounded number, generate an unbounded sequence, then |
@bsless Nice. For reference, here's what I did for the shared map case... (defn shared-random-maps [proportion n p width]
(let [shared-width (long (* proportion width))
variable-width (- width shared-width)
shared-map (randmap p shared-width)
random-shared-map (fn []
(zipmap (keys shared-map)
(genn shared-width p)))]
(repeatedly n #(merge (randmap p variable-width) (random-shared-map)))))
(defn bench-merge-shared
[proportion max-log-size nmaps]
(vec
(for [e (range 1 (inc max-log-size))
n (range 1 (inc nmaps))
pk *types*
:let [width (int (Math/pow 10 e))
p (preds pk)
ms (shared-random-maps proportion n p width)]
k [:merge :inline-merge :inline-fast-map-merge :inline-tmerge]
:let [f (merge-fns k)
_ (println 'BENCH k 'WIDTH 10 'e e '* n 'TYPE pk)
res (apply f n ms)
mn (->ns (mean res))
ratio (int (/ mn n))]]
{:bench k
:mean mn
:width e
:proportion-shared proportion
:type pk
:ratio ratio
:keys n
:heap @max-memory
:gc @gcs}))) I also refactored the (defn fast-assoc! [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc))
(defn rmerge!
"Returns a transient map that consists of the second of the maps assoc-ed
onto the first. If a key occurs in more than one map, the mapping from
te latter (left-to-right) will be the mapping in the result."
[l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l fast-assoc! r)
(p/kv-reduce l fast-assoc! r))) I backed off when I started checking test cases to investigate specific ones that were counter intuitive. There shouldn't really be a situation (in my mind) where the maps have any similarity, where clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
(criterium.core/quick-bench (inline/tmerge m1 m2 m3 m4)))
Evaluation count : 44772 in 6 samples of 7462 calls.
Execution time mean : 13.524963 µs
Execution time std-deviation : 260.329780 ns
Execution time lower quantile : 13.239612 µs ( 2.5%)
Execution time upper quantile : 13.762569 µs (97.5%)
Overhead used : 1.824294 ns
nil
clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
(criterium.core/quick-bench (inline/fast-map-merge m1 m2 m3 m4)))
Evaluation count : 36882 in 6 samples of 6147 calls.
Execution time mean : 15.978887 µs
Execution time std-deviation : 277.302647 ns
Execution time lower quantile : 15.615046 µs ( 2.5%)
Execution time upper quantile : 16.290658 µs (97.5%)
Overhead used : 1.824294 ns
nil In this case, the |
There are probably a ton of subtleties we need to be aware of. A few come to mind:
|
Yes, those are all good observations. There's definitely a threshold where arraymaps dominate (very fast copy and lookup). Random observation: looks like keyword keyed maps are about 85% faster on single key lookup than numeric (maybe due to non-identity based eq) fascinating. |
The contains vs. function invocation one is appropriate. Another option is entryAt maybe. |
Since if you're doing |
I'm getting apparently stochastic JITing too during the benchmarks. First time I run them, I get counter intuitive results, second time, the dominance flips toward expected. Very weird, since I figure the sampling from criterium is driving a decent warmup period. |
There are also some obvious bailout cases, like single-arity merge, where it doesn't make sense to transient at all (or reduce). |
fascinating, looks like |
How exactly are you benching? At this point I mistrust anything that wasn't run with direct-linked uberjar and full bench (not quick, JIT not accurate enough). |
Just for reference, when I run the entire test suit I just leave it overnight. Takes a good while. |
I've been trimming down to just the comparative runs for faster-map-merge and tmerge to get a better feedback loop. I also committed the sin of having the browser open (big no no), so fine-grained measures can be problematic. |
Looks like there's a ~13% penalty of going through |
Sorry, I missed this message. I'm going 2 different ways. First is using the bench function API from the REPL, just for the merge bench (and the new shared map one I pasted). I lifted the merge types into a dynvar so I can control which ones are run (I only care about the two right now). I'm also only doing the 10e1 case, but all 4 map calls. Then, I'm doing similar focused measures using criterium on fixed maps instead of the random, to get a better sense of what's going on. I have the browser open, which from past experience is definitely a perf hog. So this is all "runtime" benchmarking, and subject to known problems like JIT finding a local optimum and other stuff. |
Do you know of a way to profile the JIT itself on hotspot? I know some proprietary implementations come with tools which allow dumping the generated assembly and analyzing the compiled code paths. |
hotspot is magic so far as I'm concerned. I went down that rabbit hole during the icfpc code opt stuff, and learned how much I didn't know regarding inlining and profiling and all manner of stuff. I wonder if the PGO stuff from graal has any insight; you're supposed to be able to capture a dump of the profiling and use it for native image to get fast native code, although it's enterprise only. |
Ever played with this tool? I have no experience with it but it was the first result on your ubiquitous search engine https://github.com/AdoptOpenJDK/jitwatch |
I have not, but it looks useful. |
One thing which already arises from the documentation and gives me some cause for concern regarding my entire project is that all the loop-unrolling I'm doing will cause methods to be larger and thus they'll be harder to inline. Wondering if I'm shooting myself in the foot with this. |
I observed that with some stuff where I was going macro heavy. |
As they say, small methods (functions for us) are aggressively inlined. So your mileage may vary. OTOH, I don't have an idea of what the performance tradeoff of manually inlining vs. bypassing the JIT is. Perhaps there are cases where it is a better strategy. |
It doesn't get simpler. Changed the generator to: (defn genn!
[n spec]
(sequence
(comp
(drop 1)
(distinct)
(take n))
(clojure.test.check.generators/sample-seq (s/gen spec))))
(defn randmap!
([n]
(randmap! keyword? n))
([p n]
(into {} (genn! n (s/tuple p p))))) Now I'm getting |
Coming up with a consistent performance model is surprisingly difficult. I got side tracked with some work stuff and helping with perf in another library, but that's settled down. Going to look at this again with fresh eyes. |
Now I changed the input generator for merge to (defn randmap!
([n]
(randmap! keyword? n))
([p n]
(let [s (genn! n p)]
(zipmap s s)))) and |
It would be nice to have a couple of reproducible cases to examine. Don't think into vs zipmap should affect results. |
It's not into vs. zipmap, it's with |
Pushed a branch with the changes to have a more convenient reference: |
I think with the latest release I managed to squeeze even more performance out of merge. two maps of one element now merge in ~25ns. |
Another way to merge, one which is aware of type conversions (defn merge-meh
[m1 m2]
(persistent!
(reduce-kv
assoc!
(reduce-kv
assoc!
(transient clojure.lang.PersistentHashMap/EMPTY)
m1)
m2)))
(defn merge-ok
[m1 m2]
(persistent!
(reduce-kv
assoc!
(transient m1)
m2)))
(defn merge2
[m1 m2]
(let [c1 (count m1)]
(if (> c1 8)
(merge-ok m1 m2)
(if (< 8 (unchecked-add c1 (count m2)))
(merge-meh m1 m2)
(merge-ok m1 m2))))) |
Turns out that when merging small maps (array-map) that need to become hash-map, this transition is quite costly in terms of performance: (c/quick-bench (merge {:a 1 :b 6}
{:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11}))
Evaluation count : 614460 in 6 samples of 102410 calls.
Execution time mean : 974.772023 ns
Execution time std-deviation : 2.971197 ns
Execution time lower quantile : 970.592882 ns ( 2.5%)
Execution time upper quantile : 977.600861 ns (97.5%)
Overhead used : 2.144573 ns While, if the left map used as the source is already a PersistentHashMap, then it goes quite faster to merge into it (c/quick-bench (merge {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11}
{:a 1 :b 6}))
Evaluation count : 4681524 in 6 samples of 780254 calls.
Execution time mean : 138.013267 ns
Execution time std-deviation : 15.762978 ns
Execution time lower quantile : 126.502904 ns ( 2.5%)
Execution time upper quantile : 159.704613 ns (97.5%)
Overhead used : 2.144573 ns Taking some of the examples from this issue, I've written this merge that only accepts 2 maps, and it will use as "source" the bigger map, and iterate on the smaller one, thus on the case where you might have a PersistentHashMap and a PersistenArrayMap it will use the hashmap as source. Still no way to avoid the moment when 2 array maps create a hash map to be slow-ish. (defn- rassoc! [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc))
(defn- lassoc! [^clojure.lang.ITransientAssociative acc k v]
(.assoc acc k v))
(defn- rmerge!
[l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l rassoc! r)
(clojure.core.protocols/kv-reduce l rassoc! r)))
(defn- lmerge!
[l r]
(if (instance? clojure.lang.IKVReduce r)
(.kvreduce ^clojure.lang.IKVReduce r lassoc! l)
(clojure.core.protocols/kv-reduce r lassoc! l)))
(defn merge!
"Merges two maps, slightly faster than clojure.core/merge.
It will always use as 'source' the bigger map, leaving the small map as
the traversed one. It will be specially efficient if the second map is
a hash-map and the first is an array-map, as the first map won't be 'merged'
into the second, avoiding the transition from array-map to hash-map.
m2 will always override keys in m1, same as clojure.core/merge"
[m1 m2]
(with-meta
(if (< (count m1) (count m2))
(persistent! (rmerge! m1 (transient m2)))
(persistent! (lmerge! (transient m1) m2)))
(meta m1)))
edit: forgot the merge! fn, and better formatting |
@bsless I read that reduce-kv is getting some love b.t.w to be more general in clojure 1.11. Might be the best option. It could also alter some of the type-specific benchmarks. @bortex-io Very nice. Good observation about the type coercion costing us. |
@joinr we found it out while teasing the implementation out on Slack, but I never see you there :) |
I was independently working on something similar, albeit with a focus on improving clojure.core/merge performance. Watching Tommi's talk reminded me of the slowness (and I end up using it in a lot of legacy code on hot paths...). I tried just optimizing the persistent map-based variant, which got me into the same ballpark you referenced (~30%). Going with transients and a useful heuristic yields some better fruit though (~50%)...
I did some testing to eliminate sources of slowdown:
The text was updated successfully, but these errors were encountered: