-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the new profiler api of profiler plugin save the dump file when app is completed? #1624
Comments
Yes, the example profiler plugin saves the traces collected when the communicator is finalized. You can extend the example to dump traces to a file at regular intervals while they are generated instead of doing it at comm finalize. |
Understood, thank you very much for your prompt response: this is extremely helpful to me! |
When using the example of the profile plugin to record traces, I noticed that it always saves exactly 65 lines of content, even after increasing the training time or iteration count. Since I'm not yet familiar with the principles of the profile plugin and am learning through the example, this behavior feels puzzling to me. |
The example plugin only stores a limited number of traces to keep memory low. You can increase the event pool sizes through env variable. Please find more info in the documentation https://github.com/NVIDIA/nccl/tree/master/ext-profiler/example#changing-the-profiler-memory-pool-sizes |
when i try the profile plugin with nccl's example, i found the dump file is saved when my app is completed. can i profile it when my app is running? for example, i use pytorch for distributed training, but it may need one month's time, how can i record the op times by profile plugin?
The text was updated successfully, but these errors were encountered: