-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Comms Benchmark Timing #833
Improve Comms Benchmark Timing #833
Conversation
@Quentin-Anthony, thanks for the PR. Out of curiosity, have you previously tried the DS timers which leverage cuda events to support multi-stream execution. The above is not a requirement for this PR. |
I have not. My rationale for not including them is that I want these benchmarks to be runnable without DeepSpeed installed for pure torch runs. Despite this, it appears |
yes, please, let's try to keep deepspeed optional ;) ideally it should work out of the box with just pytorch installed. But then it might be a good idea to split this into its own repo? I think that way a lot more people might use it - otherwise having it buried inside this repo isn't the best location, IMHO |
@Quentin-Anthony, thanks for clarification. Sounds great to me. |
@Quentin-Anthony, I think this is a great idea. Currently this super useful utility is somewhat buried here. And it does not seem to be a true DS example, i.e., it is not demonstrating some DS functionality. |
* Change to cuda event-based timing * Add event args to called funcs * Add missing comma to args
As reported to me by @stas00, using the
time
-based calls incorrectly reports the duration of comm ops when moving to many nodes. This PR moves all benchmarks to CUDA events so that we can directly time on the stream.@awan-10 @tjruwase