Add deepseek_v3 fused gate #3191

NovTi · 2025-01-28T04:14:54Z

Add deepseek v3 fused gate module

BBuf · 2025-01-28T06:30:36Z

sgl-kernel/tests/test_deepseek_fused_gate.py

+    # Your module under test
+    output, indices_my = deepseekv3_fused_gate(tensor, bias, seq_length)
+
+    ###### Reference Implementation ######


Please refactor this code into a standalone function, which can be directly used from https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/topk.py#L111-L147.

Do you mean I separate the reference implementation into a standalone function?

Got it, I will do that

BBuf · 2025-01-28T06:33:14Z

sgl-kernel/tests/test_deepseek_fused_gate.py

+    output_ref = weights.type_as(scores)
+
+    # Assertions
+    output_check = torch.allclose(output_ref.sort()[0], output.sort()[0], rtol=1e-04, atol=1e-05)


Why not directly compare output and output_ref instead of sorting them?

This is weird, kernel sometimes will output exact same output but in a different order. I checked the following steps and the output order does not matter so I used this way to do the unit test, is this ok?

We need to determine at which specific step of the fused kernel this inconsistency in order occurs. Additionally, we need to clarify whether running the PyTorch implementation twice with the same input would result in inconsistent output orders. Finally, if you believe that the current order inconsistency does not affect the fused MoE accuracy, you need to provide an end-to-end result, such as running the GSM8K test with the DeepSeek V3 model.

I see, I will check the inconsistency inside the kernel. I cannot run e2e test on my server, Yineng will help me do the test

BBuf · 2025-01-28T06:34:21Z

sgl-kernel/tests/test_deepseek_fused_gate.py

+from sgl_kernel import deepseekv3_fused_gate
+
+
+@pytest.mark.parametrize("seq_length", range(1, 20000))


Can you add a benchmark script? Maybe refer to https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark

BBuf · 2025-01-28T06:36:33Z

sgl-kernel/src/sgl-kernel/__init__.py

@@ -3,6 +3,7 @@
    bmm_fp8,
    custom_dispose,
    custom_reduce,
+    deepseekv3_fused_gate,


It seems more appropriate to name it deepseekv3_fused_gate here, as models from the deepseek series can all go through this gate function.

This is not a generalized kernel, it only works for deepseek v3 671b model

I see, thanks.

I think it also works for DeepSeek V2 VL

BBuf · 2025-01-28T07:18:52Z

sgl-kernel/src/sgl-kernel/csrc/deepseek_fused_gate.cu

+        input.data_ptr(), bias.data_ptr(), output.data_ptr(), indices.data_ptr<int64_t>(), num_rows, k, route_scale
+    );
+
+    CHECK_CUDA_SUCCESS(cudaDeviceSynchronize());


Synchronization is not allowed in CUDA kernel's host code, as it will cause CUDA graphs to crash. Can you remove it?

Thanks, I will update these

BBuf · 2025-01-28T09:22:46Z

sgl-kernel/src/sgl-kernel/csrc/deepseek_fused_gate.cu

@@ -0,0 +1,219 @@
+#include <cfloat>


Please add Adapted from https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L231

BBuf · 2025-01-29T11:55:05Z

In TensorRT-LLM, the fused MoE module, in addition to the fused_gate here, also includes the trt_moe_expand_and_permute, the CUTLASS grouped GEMM, and the trt_moe_unpermute_and_reduce processes. Compared to the MoE implemented in Triton, the advantage of TensorRT-LLM's approach is that it does not require padding, which saves some computational overhead, and the CUTLASS implementation may have greater performance potential, especially on Hopper architecture. I conducted an experiment at https://github.com/sgl-project/sglang/tree/bbuf_tmp, where I successfully connected trt_moe_expand_and_permute, trt_moe_unpermute_and_reduce, and FlashInfer's grouped GEMM in sgl-kernel to run correctness comparisons with the Triton fused MoE operator in the case of bfloat16 dtype. However, it seems that the current performance is still significantly worse than Triton's. This could be due to performance issues with FlashInfer's grouped GEMM on specific shapes. Additionally, FlashInfer's GEMM does not currently support scaled FP8 or INT8 GEMM. If anyone is interested, we can discuss whether to directly integrate TensorRT's fused MoE as a backend into sglang or to use FlashInfer's approach, which would require a customization of FlashInfer for grouped GEMM. cc @zhyncs

zhyncs · 2025-01-29T11:57:10Z

directly integrate TensorRT's fused MoE as a backend into sglang

sounds good @BBuf

BBuf · 2025-01-29T11:57:49Z

directly integrate TensorRT's fused MoE as a backend into sglang

sounds good @BBuf

Yeah, I can have a try.

update main / deepseek fused gate

eb55e3a

NovTi requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077 and merrymercy as code owners January 28, 2025 04:14

zhyncs assigned ispobock, BBuf and zhyncs Jan 28, 2025

BBuf reviewed Jan 28, 2025

View reviewed changes

BBuf changed the title ~~Add deepseek fused gate~~ Add deepseek_v3 fused gate Jan 28, 2025

BBuf reviewed Jan 28, 2025

View reviewed changes

NovTi and others added 2 commits February 3, 2025 14:11

Fix previous issues

ba11767

Merge branch 'main' into deepseek_fused_gate

e3ec286

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deepseek_v3 fused gate #3191

Add deepseek_v3 fused gate #3191

NovTi commented Jan 28, 2025

BBuf Jan 28, 2025

NovTi Jan 28, 2025

BBuf Jan 28, 2025

NovTi Jan 28, 2025

BBuf Jan 28, 2025

NovTi Jan 28, 2025

BBuf Jan 28, 2025 •

edited

Loading

NovTi Jan 28, 2025

BBuf Jan 28, 2025

NovTi Jan 28, 2025

BBuf Jan 28, 2025 •

edited

Loading

NovTi Jan 28, 2025

BBuf Jan 28, 2025

zhyncs Jan 28, 2025

BBuf Jan 28, 2025 •

edited

Loading

NovTi Jan 28, 2025

BBuf Jan 28, 2025

BBuf commented Jan 29, 2025 •

edited

Loading

zhyncs commented Jan 29, 2025

BBuf commented Jan 29, 2025

		from sgl_kernel import deepseekv3_fused_gate


		@pytest.mark.parametrize("seq_length", range(1, 20000))

Add deepseek_v3 fused gate #3191

Are you sure you want to change the base?

Add deepseek_v3 fused gate #3191

Conversation

NovTi commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BBuf Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BBuf Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BBuf Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BBuf commented Jan 29, 2025 • edited Loading

zhyncs commented Jan 29, 2025

BBuf commented Jan 29, 2025

BBuf Jan 28, 2025 •

edited

Loading

BBuf Jan 28, 2025 •

edited

Loading

BBuf Jan 28, 2025 •

edited

Loading

BBuf commented Jan 29, 2025 •

edited

Loading