Replies: 5 comments 36 replies
-
Thanks for share , I will test epyc 9755 & 768G DDR5 6800 ECC RAM , feedback later. |
Beta Was this translation helpful? Give feedback.
-
I am testing the exact setup(AMD 9654 with 1.5T ram max memory channel) you are suggesting but using dual socket setup, using 671B Q8 model, have been getting 4-5 tokens/s. Going to test the single socket setup see if there is any speed increase. |
Beta Was this translation helpful? Give feedback.
-
2xAMD EPYC 7k62(96cores), 16x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) , 96 threads ------>>>>> 2.9 t/s for this mentioned system try to disable NUMA in system bios and let us know your cpu only inference results I have dual CPU system as well and disabling NUMA in bios increase my token output |
Beta Was this translation helpful? Give feedback.
-
with a very large 671B parameter model my token output increased from 2 t/s to 3 t/s with NUMA disabled in system bios NUMA enabled = Non-Uniform Memory Access |
Beta Was this translation helpful? Give feedback.
-
@jasonsi1993 I'm still investigating this problem. My current hypothesis is that multiplication of small matrices (expert tensor matrices are only 2048 x 7168) scales very bad on dual CPU systems. To verify this can someone check the steps below on a dual CPU system? The model to try is llama-3.2 1B as it has FFN matrices of similar size (2048 x 8192) to DeepSeek R1 experts. If I'm right it will scale equally bad as DeepSeek R1.
Post the output in replies please (and thanks!). |
Beta Was this translation helpful? Give feedback.
-
Could someone help in figuring out the best hardware configuration for LLM inference (CPU only) ?
I have done 3 tests:
I have tested the same big size model on different configurations and got the above results. So that means that llama.cpp is not optimized at all for dual-cpu-socket motherboards, and I can not use full power of such configurations to speed up LLM inference. It happened that running single instance of llama.cpp on one node (cpu) of dual-cpu setup is far better than on both of them.
A lot of different optimizations did not give any significant inference boost. So based on the above, for the best t/s inference of LLM i.e. DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) I suggest the following hardware configuration:
With this setup I am optimistically expecting something around 10 t/s inference speed of the same big model, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB). Could someone correct if Im wrong, or mb suggest yours ideas and thoughts?
Beta Was this translation helpful? Give feedback.
All reactions