-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepseek_v3 fused gate #3191
base: main
Are you sure you want to change the base?
Conversation
# Your module under test | ||
output, indices_my = deepseekv3_fused_gate(tensor, bias, seq_length) | ||
|
||
###### Reference Implementation ###### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refactor this code into a standalone function, which can be directly used from https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/topk.py#L111-L147.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I separate the reference implementation into a standalone function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I will do that
output_ref = weights.type_as(scores) | ||
|
||
# Assertions | ||
output_check = torch.allclose(output_ref.sort()[0], output.sort()[0], rtol=1e-04, atol=1e-05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly compare output and output_ref instead of sorting them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, kernel sometimes will output exact same output but in a different order. I checked the following steps and the output order does not matter so I used this way to do the unit test, is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to determine at which specific step of the fused kernel this inconsistency in order occurs. Additionally, we need to clarify whether running the PyTorch implementation twice with the same input would result in inconsistent output orders. Finally, if you believe that the current order inconsistency does not affect the fused MoE accuracy, you need to provide an end-to-end result, such as running the GSM8K test with the DeepSeek V3 model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I will check the inconsistency inside the kernel. I cannot run e2e test on my server, Yineng will help me do the test
from sgl_kernel import deepseekv3_fused_gate | ||
|
||
|
||
@pytest.mark.parametrize("seq_length", range(1, 20000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a benchmark script? Maybe refer to https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -3,6 +3,7 @@ | |||
bmm_fp8, | |||
custom_dispose, | |||
custom_reduce, | |||
deepseekv3_fused_gate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more appropriate to name it deepseekv3_fused_gate
here, as models from the deepseek series can all go through this gate
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a generalized kernel, it only works for deepseek v3 671b model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it also works for DeepSeek V2 VL
input.data_ptr(), bias.data_ptr(), output.data_ptr(), indices.data_ptr<int64_t>(), num_rows, k, route_scale | ||
); | ||
|
||
CHECK_CUDA_SUCCESS(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronization is not allowed in CUDA kernel's host code, as it will cause CUDA graphs to crash. Can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will update these
@@ -0,0 +1,219 @@ | |||
#include <cfloat> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Adapted from https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L231
In TensorRT-LLM, the fused MoE module, in addition to the |
sounds good @BBuf |
Yeah, I can have a try. |
Add deepseek v3 fused gate module