-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTCLoss: Fix the hang issue caused by barrier divergence #1087
Conversation
@@ -111,7 +111,6 @@ struct CTCLossLogAlphaKernelFunctor { | |||
have_three = false; | |||
} | |||
for (int64_t t = 1; t < max_input_length_; t++) { | |||
item.barrier(sycl_local_fence); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it depends on workloads to expose such kind of issues. My question is why we could not find issues earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To catch such issues you typically need to do one of the 2 things, better both:
- Each or almost each change to the code should be accompanied with dedicated test. If you reuse existing test, we need to review that it tests all corner cases. With the issue we spotted we could check that early exit conditions are actually being tried out.
- Run real life tests, preferably from real life 3rd party framework or library. Huggingface Transformers gives you excellent way to do this.
1st item is a better guarantee that issues won't be missed. 2nd item is a lesser guarantee, but gives certainty that at least some real life cases will work. In both cases issues might be missed, but utilizing both we reduce the probability that something will get missed.
With this PR issue reported in pytorch/pytorch#140781 is gone. The HF tests for hubert model pass as follows for me: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust you decision that barrier really is not needed. Other than that change works to fix the issue I noticed. Consider to extend test coverage to cover the missed case.
ctc_loss_kernel
ctc_loss_kernel
We still need barriers. The reason for the hang is that some threads exit prematurely, preventing the counter from resetting to zero. We are now planning to use |
ctc_loss_kernel
named barriers
instead of group barrier in ctc_loss_kernel
named barriers
instead of group barrier in ctc_loss_kernel
named barrier
instead of group barrier in ctc_loss_kernel
named barrier
instead of group barrier in ctc_loss_kernel
I was said by sycl folks that named barriers might have performance drawbacks on current generations. Be careful to verify performance. |
I verified updated version. It works to address reported issue. |
Resolve pytorch/pytorch#140781