Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek V3 Support #760

Open
casper-hansen opened this issue Dec 26, 2024 · 5 comments
Open

DeepSeek V3 Support #760

casper-hansen opened this issue Dec 26, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@casper-hansen
Copy link
Contributor

casper-hansen commented Dec 26, 2024

@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.

Main parallelism components:

  • 64-way expert parallelism
  • 16-way pipeline parallelism
  • with ZeRO-1 data parallelism
  • Note: they do not apply TP.

Other main modeling components:

  • multi-head latent attention (MLA)
  • multi-token prediction with their MTP modules
  • mixed-precision training (mix of FP8, BF16, FP32)

Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3
Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Performance:
image

@tianyu-l tianyu-l added the enhancement New feature or request label Dec 27, 2024
@casper-hansen
Copy link
Contributor Author

@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model?

I understand this model is not created by Meta, but I (along with others) would value a contribution on efficient training of this model in TorchTitan.

@tianyu-l
Copy link
Contributor

I agree we probably should prioritize supporting this model.

However I feel supporting all training optimizations mentioned in the technical report could be heavy and/or may not be totally aligned with the purpose of torchtitan. Would it still be interesting if we support the model and train it "in our own way", e.g. using parallelisms / optimizations similar to what we do to Llama?

@casper-hansen
Copy link
Contributor Author

@tianyu-l I am mainly interested in a model architecture implementation. The remaining details like FP8 training and various forms of parallelism is already implemented in TorchTitan, which should be reused.

So it's mainly the following components that I am asking for:

  • MoE
  • multi-head latent attention (MLA)
  • multi-token prediction heads

A good starting point would be if we can convert the weights provided on Huggingface to TorchTitan and continue to train it.

@lxww302
Copy link

lxww302 commented Jan 27, 2025

is MoE architecture already supported in this branch? #730

@lessw2020
Copy link
Contributor

Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief spin with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants