Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server crashed at the begining #37

Closed
jasperzhong opened this issue Jul 18, 2020 · 4 comments
Closed

server crashed at the begining #37

jasperzhong opened this issue Jul 18, 2020 · 4 comments

Comments

@jasperzhong
Copy link
Owner

@vycezhong I ran this PR with our mxnet vgg-16 test to check for regression. I used 2 worker nodes, each node has 8 GPUs, and 2 server nodes. One of the server nodes will core dump, it happens consistently. Is this something you've seen before? I didn't change the test to use gradient compression, so dmlc/ps-lite#168 shouldn't matter here.

[00:06:35] byteps/server/server.cc:430: BytePS server engine uses 16 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[00:06:35] byteps/server/server.cc:438: Enable engine scheduling for BytePS server
[00:06:35] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[00:06:35] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[00:06:35] [src/van.cc:421: Bind to role=server, ip=xxxxxxx, port=48413, is_recovery=0
00:06:35] src/./zmq_van.h:287: Start ZMQ recv thread
[00:06:35] src/van.cc:510: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=xxxxxxxx, port=48413, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:535: 1 => 2147483647. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=xxx.196, port=35657, is_recovery=0 role=server, id=8, ip=xxx.195, port=61601, is_recovery=0 role=server, id=10, ip=xxx.144, port=48413, is_recovery=0 role=worker, id=11, ip=xxx.142, port=29591, is_recovery=0 role=scheduler, id=1, ip=xxx.195, port=9000, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:370: S[10] is connected to others
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 1 => 10. Meta: request=0, timestamp=8, control={ cmd=BARRIER, barrier_group=-564201712 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 11 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140723023324848, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
[00:07:35] src/van.cc:535: 9 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140724865464560, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
Segmentation fault      (core dumped) bpslaunch

Originally posted by @pleasantrabbit in bytedance#225 (comment)

@jasperzhong
Copy link
Owner Author

I cannot reproduce this bug in my local dev machine. I will try it later in distributed settings. I guess maybe it is due to the usage of latest ps-lite. @pleasantrabbit

@jasperzhong
Copy link
Owner Author

I wonder if you have solved this problem. @pleasantrabbit

@pleasantrabbit
Copy link

@vycezhong yes, this change seems to solve it, but I still don't understand the reason: bytedance@0d8a38a

I wonder if you have solved this problem. @pleasantrabbit

@jasperzhong
Copy link
Owner Author

That's werid. I think the two piece of code should be equivalent. I use std::memcpy for better performance because omp's overhead take up most of the time for small tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants