Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running in docker will give you an error that you can't find a physical address #76

Open
guanyonglai opened this issue Jan 29, 2023 · 1 comment

Comments

@guanyonglai
Copy link

[8d6bf97c3bf4:3039 :0:3095] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 3095) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000007b0ef ncclGroupEnd() ???:0
4 0x0000000000059e97 ncclGetUniqueId() ???:0
5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
7 0x0000000000063dcc ncclRedOpDestroy() ???:0
8 0x0000000000008609 start_thread() ???:0
9 0x000000000011f133 clone() ???:0
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@guanyonglai
Copy link
Author

这个是因为docker run的时候默认分配的共享内存不够,只有64M。可以在docker run的时候加上--shm-size="6g"用于自定义分配更多共享内存。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant