CTCLoss: Fix the hang issue caused by barrier divergence #1087

xytintel · 2024-11-15T04:06:04Z

EikanWang · 2024-11-15T05:59:44Z

src/ATen/native/xpu/sycl/LossCTCKernels.cpp

@@ -111,7 +111,6 @@ struct CTCLossLogAlphaKernelFunctor {
        have_three = false;
      }
      for (int64_t t = 1; t < max_input_length_; t++) {
-        item.barrier(sycl_local_fence);


Currently, it depends on workloads to expose such kind of issues. My question is why we could not find issues earlier.

To catch such issues you typically need to do one of the 2 things, better both:

Each or almost each change to the code should be accompanied with dedicated test. If you reuse existing test, we need to review that it tests all corner cases. With the issue we spotted we could check that early exit conditions are actually being tried out.

Run real life tests, preferably from real life 3rd party framework or library. Huggingface Transformers gives you excellent way to do this.

1st item is a better guarantee that issues won't be missed. 2nd item is a lesser guarantee, but gives certainty that at least some real life cases will work. In both cases issues might be missed, but utilizing both we reduce the probability that something will get missed.

dvrogozh · 2024-11-15T18:41:39Z

With this PR issue reported in pytorch/pytorch#140781 is gone. The HF tests for hubert model pass as follows for me: 156 passed, 257 skipped, 3 warnings in 52.34s.

dvrogozh

I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.

xytintel · 2024-11-18T02:25:31Z

I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.

We still need barriers. The reason for the hang is that some threads exit prematurely, preventing the counter from resetting to zero. We are now planning to use named barrier to solve this problem.

dvrogozh · 2024-11-18T17:59:51Z

We are now planning to use named barrier to solve this problem.

I was said by sycl folks that named barriers might have performance drawbacks on current generations. Be careful to verify performance.

dvrogozh · 2024-11-18T18:03:45Z

I verified updated version. It works to address reported issue.

Remove unnecessary barriers in ctc_loss_kernel

9a733fc

xytintel requested a review from chunhuanMeng November 15, 2024 04:27

EikanWang reviewed Nov 15, 2024

View reviewed changes

EikanWang approved these changes Nov 15, 2024

View reviewed changes

dvrogozh approved these changes Nov 15, 2024

View reviewed changes

xytintel changed the title ~~Remove unnecessary barriers in ctc_loss_kernel~~ Using named barriers instead of group barrier in ctc_loss_kernel Nov 18, 2024

xytintel changed the title ~~Using named barriers instead of group barrier in ctc_loss_kernel~~ Adopt named barriers instead of group barrier in ctc_loss_kernel Nov 18, 2024

xytintel changed the title ~~Adopt named barriers instead of group barrier in ctc_loss_kernel~~ Adopt named barrier instead of group barrier in ctc_loss_kernel Nov 18, 2024

Bypass sg divergence for barrier

8e9a1f3

xytintel changed the title ~~Adopt named barrier instead of group barrier in ctc_loss_kernel~~ CTCLoss: Fix the hang issue caused by barrier divergence Nov 18, 2024

xytintel added 3 commits November 18, 2024 13:17

Merge branch 'main' into xyt/fix_ctc_loss_hang

3c7ddf1

Update test_nn_xpu.py

8c2f1c9

Update LossCTCKernels.cpp

9982cf7

xytintel added this pull request to the merge queue Nov 18, 2024

Merged via the queue into main with commit f9c7682 Nov 18, 2024
3 checks passed

xytintel deleted the xyt/fix_ctc_loss_hang branch November 18, 2024 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTCLoss: Fix the hang issue caused by barrier divergence #1087

CTCLoss: Fix the hang issue caused by barrier divergence #1087

xytintel commented Nov 15, 2024 •

edited

Loading

EikanWang Nov 15, 2024

dvrogozh Nov 15, 2024

dvrogozh commented Nov 15, 2024

dvrogozh left a comment

xytintel commented Nov 18, 2024

dvrogozh commented Nov 18, 2024

dvrogozh commented Nov 18, 2024

CTCLoss: Fix the hang issue caused by barrier divergence #1087

CTCLoss: Fix the hang issue caused by barrier divergence #1087

Conversation

xytintel commented Nov 15, 2024 • edited Loading

EikanWang Nov 15, 2024

Choose a reason for hiding this comment

dvrogozh Nov 15, 2024

Choose a reason for hiding this comment

dvrogozh commented Nov 15, 2024

dvrogozh left a comment

Choose a reason for hiding this comment

xytintel commented Nov 18, 2024

dvrogozh commented Nov 18, 2024

dvrogozh commented Nov 18, 2024

xytintel commented Nov 15, 2024 •

edited

Loading