Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the new profiler api of profiler plugin save the dump file when app is completed? #1624

Open
Xiaoaier-Z-L opened this issue Mar 2, 2025 · 4 comments

Comments

@Xiaoaier-Z-L
Copy link

when i try the profile plugin with nccl's example, i found the dump file is saved when my app is completed. can i profile it when my app is running? for example, i use pytorch for distributed training, but it may need one month's time, how can i record the op times by profile plugin?

@Xiaoaier-Z-L Xiaoaier-Z-L changed the title Does the new profile api of profile plugin save the dump file when app is completed? Does the new profiler api of profiler plugin save the dump file when app is completed? Mar 2, 2025
@gcongiu
Copy link
Contributor

gcongiu commented Mar 2, 2025

Yes, the example profiler plugin saves the traces collected when the communicator is finalized. You can extend the example to dump traces to a file at regular intervals while they are generated instead of doing it at comm finalize.

@Xiaoaier-Z-L
Copy link
Author

Understood, thank you very much for your prompt response: this is extremely helpful to me!

@Xiaoaier-Z-L
Copy link
Author

Xiaoaier-Z-L commented Mar 3, 2025

When using the example of the profile plugin to record traces, I noticed that it always saves exactly 65 lines of content, even after increasing the training time or iteration count. Since I'm not yet familiar with the principles of the profile plugin and am learning through the example, this behavior feels puzzling to me.
Is the trace output truncated by default?

@gcongiu
Copy link
Contributor

gcongiu commented Mar 3, 2025

The example plugin only stores a limited number of traces to keep memory low. You can increase the event pool sizes through env variable. Please find more info in the documentation https://github.com/NVIDIA/nccl/tree/master/ext-profiler/example#changing-the-profiler-memory-pool-sizes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants