DeepSeek V3 Support #760

casper-hansen · 2024-12-26T15:44:12Z

@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.

Main parallelism components:

64-way expert parallelism
16-way pipeline parallelism
with ZeRO-1 data parallelism
Note: they do not apply TP.

Other main modeling components:

multi-head latent attention (MLA)
multi-token prediction with their MTP modules
mixed-precision training (mix of FP8, BF16, FP32)

Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3
Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Performance:

casper-hansen · 2025-01-27T08:09:32Z

@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model?

I understand this model is not created by Meta, but I (along with others) would value a contribution on efficient training of this model in TorchTitan.

tianyu-l · 2025-01-27T09:17:29Z

I agree we probably should prioritize supporting this model.

However I feel supporting all training optimizations mentioned in the technical report could be heavy and/or may not be totally aligned with the purpose of torchtitan. Would it still be interesting if we support the model and train it "in our own way", e.g. using parallelisms / optimizations similar to what we do to Llama?

casper-hansen · 2025-01-27T09:55:50Z

@tianyu-l I am mainly interested in a model architecture implementation. The remaining details like FP8 training and various forms of parallelism is already implemented in TorchTitan, which should be reused.

So it's mainly the following components that I am asking for:

MoE
multi-head latent attention (MLA)
multi-token prediction heads

A good starting point would be if we can convert the weights provided on Huggingface to TorchTitan and continue to train it.

lxww302 · 2025-01-27T18:59:34Z

is MoE architecture already supported in this branch? #730

lessw2020 · 2025-01-27T20:03:49Z

Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief spin with it.

tianyu-l added the enhancement New feature or request label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek V3 Support #760

DeepSeek V3 Support #760

casper-hansen commented Dec 26, 2024 •

edited

Loading

casper-hansen commented Jan 27, 2025

tianyu-l commented Jan 27, 2025

casper-hansen commented Jan 27, 2025

lxww302 commented Jan 27, 2025 •

edited

Loading

lessw2020 commented Jan 27, 2025

DeepSeek V3 Support #760

DeepSeek V3 Support #760

Comments

casper-hansen commented Dec 26, 2024 • edited Loading

casper-hansen commented Jan 27, 2025

tianyu-l commented Jan 27, 2025

casper-hansen commented Jan 27, 2025

lxww302 commented Jan 27, 2025 • edited Loading

lessw2020 commented Jan 27, 2025

casper-hansen commented Dec 26, 2024 •

edited

Loading

lxww302 commented Jan 27, 2025 •

edited

Loading