Inference LLM Deepseek-v3_671B on CPU only. #11765

jonndoe · 2025-02-08T21:02:22Z

jonndoe
Feb 8, 2025

Could someone help in figuring out the best hardware configuration for LLM inference (CPU only) ?

I have done 3 tests:

AMD Threadripper pro 3955wx(16cores), 8x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) , 16 threads ------->>>>> 2.8 t/s
2xAMD EPYC 7k62(96cores), 16x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) , 96 threads ------>>>>> 2.9 t/s
1xAMD EPYC 7k62(48cores), 8x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) 48 threads ------>>>>> 4.2 t/s

I have tested the same big size model on different configurations and got the above results. So that means that llama.cpp is not optimized at all for dual-cpu-socket motherboards, and I can not use full power of such configurations to speed up LLM inference. It happened that running single instance of llama.cpp on one node (cpu) of dual-cpu setup is far better than on both of them.
A lot of different optimizations did not give any significant inference boost. So based on the above, for the best t/s inference of LLM i.e. DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) I suggest the following hardware configuration:

Maximum memory channels CPU to be used, i.e. EPYC 9654, which has 12 channels,
More cores CPU to be used, i.e. EPYC 9654, 96 cores,
More frequency CPU to be used, i.e. EPYC 9654, Up to 3.7 GHz
DDR5 ECC memory to be used,
All 12 DDR5 slots to be engaged.
CPU Dual-socket motherboards are no benefit.

With this setup I am optimistically expecting something around 10 t/s inference speed of the same big model, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB). Could someone correct if Im wrong, or mb suggest yours ideas and thoughts?

OOXXXXOO · 2025-02-10T10:13:09Z

OOXXXXOO
Feb 10, 2025

Thanks for share , I will test epyc 9755 & 768G DDR5 6800 ECC RAM , feedback later.

2 replies

jonndoe Feb 11, 2025
Author

Oh, thanks, that is absolutely monstrous and tremendous CPU!) Very interesting to see the results especially if utilize all DDR5 memory channels .... guess it can be something more than 8-10 t/s...

lanmao998 Feb 16, 2025

Hi, do you have a result. I really want to know the performance

jasonsi1993 · 2025-02-12T06:33:05Z

jasonsi1993
Feb 12, 2025

I am testing the exact setup(AMD 9654 with 1.5T ram max memory channel) you are suggesting but using dual socket setup, using 671B Q8 model, have been getting 4-5 tokens/s. Going to test the single socket setup see if there is any speed increase.

18 replies

mrdg-sys Feb 17, 2025

on my intel dual cpu system if cpu 2 and numa is disabled in bios my cpu 1 still has access to entire memory pool from both cpu, Im not sure how that works maybe cpu 2 memory controllers remain active because I know for a fact that removing cpu 2 would also disable all its memory banks

jasonsi1993 Feb 18, 2025

on my intel dual cpu system if cpu 2 and numa is disabled in bios my cpu 1 still has access to entire memory pool from both cpu, Im not sure how that works maybe cpu 2 memory controllers remain active because I know for a fact that removing cpu 2 would also disable all its memory banks

One quick question does disable numa means set numa nps to nps 1？

mrdg-sys Feb 18, 2025

not exactly

my bios has 3 options for numa: enable/disable, 1-way, 2-way

..and I get best token performance with numa disabled option

jasonsi1993 Feb 18, 2025

not exactly

my bios has 3 options for numa: enable/disable, 1-way, 2-way

..and I get best token performance with numa disabled option

My guess it is equivalent to my nps 0 nps 1 nps 2. I will give it try see if there is any increase. I am currently using nps1

jonndoe Feb 18, 2025
Author

not exactly

my bios has 3 options for numa: enable/disable, 1-way, 2-way

..and I get best token performance with numa disabled option

My guess it is equivalent to my nps 0 nps 1 nps 2. I will give it try see if there is any increase. I am currently using nps1

Did you make any checks to see which numa node is faster?

mrdg-sys · 2025-02-15T05:36:59Z

mrdg-sys
Feb 15, 2025

2xAMD EPYC 7k62(96cores), 16x64GB RAM, DeepSeek-R1-Q5_K_S.gguf (671B, 461.81GB) , 96 threads ------>>>>> 2.9 t/s

for this mentioned system try to disable NUMA in system bios and let us know your cpu only inference results

I have dual CPU system as well and disabling NUMA in bios increase my token output

7 replies

jasonsi1993 Feb 17, 2025

I am using similar as you do. I am getting 6.2 token/s on amd 9654, DDR5 4800mhz, 671B Q8 model. Are you using llama-bench to test for performance benchmark? Since you are using dual socket as well. I suggest to keep the testing to one socket. Thats how i bumped from 4 to 6 token/s.

jonndoe Feb 17, 2025
Author

6.2 is not bad. But it looks like it could be increased even more, lets say till 8 t/s. Can you give your motherboard model?

jasonsi1993 Feb 18, 2025

6.2 is not bad. But it looks like it could be increased even more, lets say till 8 t/s. Can you give your motherboard model?

I currently do not have access to that machine. Is there any other info i can provide to help?

jasonsi1993 Feb 18, 2025

6.2 is not bad. But it looks like it could be increased even more, lets say till 8 t/s. Can you give your motherboard model?

Also have you managed to push to 8 t/s?

jonndoe Feb 18, 2025
Author

6.2 is not bad. But it looks like it could be increased even more, lets say till 8 t/s. Can you give your motherboard model?

I currently do not have access to that machine. Is there any other info i can provide to help?

You can try the following optimizations if not yet:

Run model on different numa nodes to see which one is fastest;
utilize all RAM slots, it sould be 12 for each CPU, totally 24 slots.
compile llama cpp from repo,
with all optimizations and "for cpu".

mrdg-sys · 2025-02-15T22:16:13Z

mrdg-sys
Feb 15, 2025

with a very large 671B parameter model my token output increased from 2 t/s to 3 t/s with NUMA disabled in system bios
with small models, like the 14B parameter my output increased from 5 t/s to 8 t/s

NUMA enabled = Non-Uniform Memory Access
NUMA disabled = Uniform Memory Access

7 replies

jasonsi1993 Feb 18, 2025

Is there any recommendation or certain steps on how to determine the bottleneck? I am using a dual socket Genoa system. The bandwidth is close to 140Gb/s on a 671B Q8 model, which is far from being io bound. The ddr5 is at 4800Mhz, i am using one socket to do the inference though. As you mentioned one socket is performing on par with dual socket.

fairydreaming Feb 18, 2025
Collaborator

@jasonsi1993 Unfortunately the exact cause is still unknown, more research is needed. As the next step I'm going to create a NUMA-aware ggml_mul_mat()/ggml_mul_mat_id() benchmarking tool that will allow to compare the performance of matrix multiplications between single-CPU and dual-CPU systems.

mrdg-sys Feb 18, 2025

Interesting to see the hardware details as well.

Fujitsu RX2530 M4 1U Server, dual 6126 xeon with 384GB ram, no gpu

jonndoe Feb 19, 2025
Author

inference

What is the downloadable size of the model you are running? It is 671B but what is actual disk size of the model?
I think if you try running 671B but q_5 model, the inference will be quite faster, with almost the same quality of the model answers.

mrdg-sys Feb 19, 2025

my 671B model disk size is 220GB with 2.51bit quantization, are you saying that a higher quant makes inference faster?

fairydreaming · 2025-02-18T18:14:32Z

fairydreaming
Feb 18, 2025
Collaborator

@jasonsi1993 I'm still investigating this problem. My current hypothesis is that multiplication of small matrices (expert tensor matrices are only 2048 x 7168) scales very bad on dual CPU systems. To verify this can someone check the steps below on a dual CPU system? The model to try is llama-3.2 1B as it has FFN matrices of similar size (2048 x 8192) to DeepSeek R1 experts. If I'm right it will scale equally bad as DeepSeek R1.

echo 3 > /proc/sys/vm/drop_caches # as root
# first check the performance on a single CPU
for try in 1 2 3 4 5; do numactl -m 0 -N 0 llama-bench --numa numactl -t 16 -m llama-3.2-1b.gguf; done;
echo 3 > /proc/sys/vm/drop_caches # as root
# load weights while generating tokens (trick to avoid generation performance reduction on NUMA systems)
llama-bench --numa distribute -t 32 -m llama-3.2-1b.gguf -p 0 -r 1
# check the performance on both CPUs
for try in 1 2 3 4 5; do llama-bench --numa distribute -t 32 -m llama-3.2-1b.gguf; done;

Post the output in replies please (and thanks!).

2 replies

jasonsi1993 Feb 19, 2025

@jasonsi1993 I'm still investigating this problem. My current hypothesis is that multiplication of small matrices (expert tensor matrices are only 2048 x 7168) scales very bad on dual CPU systems. To verify this can someone check the steps below on a dual CPU system? The model to try is llama-3.2 1B as it has FFN matrices of similar size (2048 x 8192) to DeepSeek R1 experts. If I'm right it will scale equally bad as DeepSeek R1.
echo 3 > /proc/sys/vm/drop_caches # as root
# first check the performance on a single CPU
for try in 1 2 3 4 5; do numactl -m 0 -N 0 llama-bench --numa numactl -t 16 -m llama-3.2-1b.gguf; done;
echo 3 > /proc/sys/vm/drop_caches # as root
# load weights while generating tokens (trick to avoid generation performance reduction on NUMA systems)
llama-bench --numa distribute -t 32 -m llama-3.2-1b.gguf -p 0 -r 1
# check the performance on both CPUs
for try in 1 2 3 4 5; do llama-bench --numa distribute -t 32 -m llama-3.2-1b.gguf; done;
Post the output in replies please (and thanks!).

I will goahead to download that model and let you know. The machine i have is AMD9654 and which build(commit) version are you using?

fairydreaming Feb 19, 2025
Collaborator

@jasonsi1993 I used llama.cpp build 4663 (commit c026ba3) for my tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference LLM Deepseek-v3_671B on CPU only. #11765

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 36 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Inference LLM Deepseek-v3_671B on CPU only. #11765

Replies: 5 comments · 36 replies

jonndoe Feb 11, 2025 Author

jonndoe Feb 18, 2025 Author

jonndoe Feb 17, 2025 Author

jonndoe Feb 18, 2025 Author

fairydreaming Feb 18, 2025 Collaborator

jonndoe Feb 19, 2025 Author

fairydreaming Feb 18, 2025 Collaborator

fairydreaming Feb 19, 2025 Collaborator

Replies: 5 comments 36 replies

jonndoe Feb 11, 2025
Author

jonndoe Feb 18, 2025
Author

jonndoe Feb 17, 2025
Author

jonndoe Feb 18, 2025
Author

fairydreaming Feb 18, 2025
Collaborator

jonndoe Feb 19, 2025
Author

fairydreaming
Feb 18, 2025
Collaborator

fairydreaming Feb 19, 2025
Collaborator