-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to debug distributed mod? #161
Comments
Does it run normally in non-distributed mode? Did you start scheduler and server? We need your complete starting commands so that we can really help you. |
It runs normally in non-distributed mode. I did not start scheduler and server. Can you give me a guide about running distributed mode with only one worker like this? |
@bobzhuyb If I set the BYTEPS_FORCE_DISTRIBUTED as following, should I also start scheduler and server? |
If you set BYTEPS_FORCE_DISTRIBUTED, you must start at least one server and one scheduler, just like you are starting distributed training job. You can start the server and scheduler on the same machine as worker, and set DMLC_PS_ROOT_URI to be your local IP or even 127.0.0.1 |
@bobzhuyb
|
Why did you overwrite the BYTEPS_SERVER_MXNET_PATH? I don't think you should overwrite it. Anyways, we'll get rid of the MXNet part in the BytePS soon, so you won't have this kind of problem anymore. See the |
@bobzhuyb If I do not overwrite the BYTEPS_SERVER_MXNET_PATH, there will be a problem: |
This is very strange. The environmental variable should be there. See If you don't have it, set it like the dockerfile |
Is |
Can you set |
Let's clarify a few things..
I feel that |
You are right! When I change the IP to what |
What do you mean by From your screenshot, it does not look like IP binding/connection problem.. |
Agree with you that it is not the port problem. So why did worker crash? |
Are you using the docker image provided by us or built from code by yourself? Can you confirm that the same build work okay in non-distributed mode? There is a knob here https://github.com/bytedance/byteps/blob/master/launcher/launch.py#L37 that allows you to run the program with gdb. Hopefully it will give us more info.. |
|
The output you have is just a warning that you can probably ignore. The problem is that byteps launcher uses gdb to start the shell script |
I only have one worker and want to debug distributed mod. I export BYTEPS_FORCE_DISTRIBUTED=1 like this but failed. What others should I set?
The text was updated successfully, but these errors were encountered: