-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server crashed at the begining #37
Comments
I cannot reproduce this bug in my local dev machine. I will try it later in distributed settings. I guess maybe it is due to the usage of latest ps-lite. @pleasantrabbit |
I wonder if you have solved this problem. @pleasantrabbit |
@vycezhong yes, this change seems to solve it, but I still don't understand the reason: bytedance@0d8a38a
|
That's werid. I think the two piece of code should be equivalent. I use |
@vycezhong I ran this PR with our mxnet vgg-16 test to check for regression. I used 2 worker nodes, each node has 8 GPUs, and 2 server nodes. One of the server nodes will core dump, it happens consistently. Is this something you've seen before? I didn't change the test to use gradient compression, so dmlc/ps-lite#168 shouldn't matter here.
Originally posted by @pleasantrabbit in bytedance#225 (comment)
The text was updated successfully, but these errors were encountered: