Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTCLoss: Fix the hang issue caused by barrier divergence #1087

Merged
merged 5 commits into from
Nov 18, 2024

Conversation

xytintel
Copy link
Contributor

@xytintel xytintel commented Nov 15, 2024

@@ -111,7 +111,6 @@ struct CTCLossLogAlphaKernelFunctor {
have_three = false;
}
for (int64_t t = 1; t < max_input_length_; t++) {
item.barrier(sycl_local_fence);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it depends on workloads to expose such kind of issues. My question is why we could not find issues earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To catch such issues you typically need to do one of the 2 things, better both:

  1. Each or almost each change to the code should be accompanied with dedicated test. If you reuse existing test, we need to review that it tests all corner cases. With the issue we spotted we could check that early exit conditions are actually being tried out.
  2. Run real life tests, preferably from real life 3rd party framework or library. Huggingface Transformers gives you excellent way to do this.

1st item is a better guarantee that issues won't be missed. 2nd item is a lesser guarantee, but gives certainty that at least some real life cases will work. In both cases issues might be missed, but utilizing both we reduce the probability that something will get missed.

@dvrogozh
Copy link
Contributor

With this PR issue reported in pytorch/pytorch#140781 is gone. The HF tests for hubert model pass as follows for me: 156 passed, 257 skipped, 3 warnings in 52.34s.

Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.

@xytintel xytintel changed the title Remove unnecessary barriers in ctc_loss_kernel Using named barriers instead of group barrier in ctc_loss_kernel Nov 18, 2024
@xytintel
Copy link
Contributor Author

I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.

We still need barriers. The reason for the hang is that some threads exit prematurely, preventing the counter from resetting to zero. We are now planning to use named barrier to solve this problem.

@xytintel xytintel changed the title Using named barriers instead of group barrier in ctc_loss_kernel Adopt named barriers instead of group barrier in ctc_loss_kernel Nov 18, 2024
@xytintel xytintel changed the title Adopt named barriers instead of group barrier in ctc_loss_kernel Adopt named barrier instead of group barrier in ctc_loss_kernel Nov 18, 2024
@xytintel xytintel changed the title Adopt named barrier instead of group barrier in ctc_loss_kernel CTCLoss: Fix the hang issue caused by barrier divergence Nov 18, 2024
@xytintel xytintel added this pull request to the merge queue Nov 18, 2024
Merged via the queue into main with commit f9c7682 Nov 18, 2024
3 checks passed
@xytintel xytintel deleted the xyt/fix_ctc_loss_hang branch November 18, 2024 12:47
@dvrogozh
Copy link
Contributor

We are now planning to use named barrier to solve this problem.

I was said by sycl folks that named barriers might have performance drawbacks on current generations. Be careful to verify performance.

@dvrogozh
Copy link
Contributor

I verified updated version. It works to address reported issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xpu: CTCLossLogAlphaKernelFunctor hangs on HF hubert test_retain_grad_hidden_states_attentions
3 participants