Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug distributed mod? #161

Closed
hbsun2113 opened this issue Nov 24, 2019 · 24 comments
Closed

How to debug distributed mod? #161

hbsun2113 opened this issue Nov 24, 2019 · 24 comments

Comments

@hbsun2113
Copy link

I only have one worker and want to debug distributed mod. I export BYTEPS_FORCE_DISTRIBUTED=1 like this but failed. What others should I set?
image
image

@hbsun2113
Copy link
Author

nohup.txt

@hbsun2113
Copy link
Author

image

@bobzhuyb
Copy link
Member

Does it run normally in non-distributed mode? Did you start scheduler and server? We need your complete starting commands so that we can really help you.

@hbsun2113
Copy link
Author

Does it run normally in non-distributed mode? Did you start scheduler and server? We need your complete starting commands so that we can really help you.

It runs normally in non-distributed mode. I did not start scheduler and server. Can you give me a guide about running distributed mode with only one worker like this?

@hbsun2113
Copy link
Author

@bobzhuyb If I set the BYTEPS_FORCE_DISTRIBUTED as following, should I also start scheduler and server?
image

@bobzhuyb
Copy link
Member

@bobzhuyb If I set the BYTEPS_FORCE_DISTRIBUTED as following, should I also start scheduler and server?
image

If you set BYTEPS_FORCE_DISTRIBUTED, you must start at least one server and one scheduler, just like you are starting distributed training job. You can start the server and scheduler on the same machine as worker, and set DMLC_PS_ROOT_URI to be your local IP or even 127.0.0.1

@hbsun2113
Copy link
Author

@bobzhuyb
Hi, I follow the steps to start scheduler:

docker pull bytepsimage/byteps_server
docker run -it --net=host bytepsimage/byteps_server bash
export DMLC_NUM_WORKER=1
export DMLC_ROLE=scheduler 
export DMLC_NUM_SERVER=1 
export DMLC_PS_ROOT_URI=127.0.0.1
export DMLC_PS_ROOT_PORT=3034  

but meet the following problem, could you help me?
image

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

@bobzhuyb
image

@bobzhuyb
Copy link
Member

Why did you overwrite the BYTEPS_SERVER_MXNET_PATH? I don't think you should overwrite it.

Anyways, we'll get rid of the MXNet part in the BytePS soon, so you won't have this kind of problem anymore. See the server branch. We will merge it in a week or two, and update the docker image.

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

@bobzhuyb If I do not overwrite the BYTEPS_SERVER_MXNET_PATH, there will be a problem:
image
Is there a method that I can start the scheduler now?

@bobzhuyb
Copy link
Member

bobzhuyb commented Nov 25, 2019

@bobzhuyb If I do not overwrite the BYTEPS_SERVER_MXNET_PATH, there will be a problem:
image
Is there a method that I can start the scheduler now?

This is very strange. The environmental variable should be there. See
https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.server#L29
or
https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.mix.mxnet15#L119

If you don't have it, set it like the dockerfile

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

@bobzhuyb
Now I can start scheduler successfully with export DMLC_PS_ROOT_URI=127.0.0.1:
image
But if I change it to export DMLC_PS_ROOT_URI=xx.xx.xx.xx where xx.xx.xx.xx means a public ip, it failed as following:
image
I think it is a bug for ps-lite, do you notice it?

@bobzhuyb
Copy link
Member

Is xx.xx.xx.xx an IP that belongs to the machine and visible to the docker? Is the DMLC_PS_ROOT_PORT port available?... What value did you set for DMLC_PS_ROOT_PORT ?

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

Is xx.xx.xx.xx an IP that belongs to the machine and visible to the docker? Is the DMLC_PS_ROOT_PORT port available?... What value did you set for DMLC_PS_ROOT_PORT ?

xx.xx.xx.xx is a public IP that belongs to the machine and visible to the docker.
DMLC_PS_ROOT_PORT is 3034, and I think it is available. The following is how to test the port:
I python3 -m http.server in the docker and can visit xx.xx.xx.xx:3034 by my chrome:
image
what the docker shows:
image

My env is:
image

@bobzhuyb
Copy link
Member

Can you set PS_VERBOSE=2 when you run the scheduler? This will output more logs.

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

Can you set PS_VERBOSE=2 when you run the scheduler? This will output more logs.

image

image

@bobzhuyb
Copy link
Member

bobzhuyb commented Nov 25, 2019

Let's clarify a few things..

  1. Can you bind your HTTP server to the 162.x.x.x IP and 3034 port?

  2. When you say 162.x.x.x "belongs to the machine", do you mean the public IP is assigned by OpenStack/K8S/other cloud, or do you mean that it actually shows up in ifconfig?

I feel that 162.x.x.x is not an IP that can be bound locally.

@hbsun2113
Copy link
Author

hbsun2113 commented Nov 25, 2019

Let's clarify a few things.

  1. Can you bind your HTTP server to the 162.x.x.x IP and 3034 port?
  2. When you say 162.x.x.x "belongs to the machine", do you mean the public IP is assigned by OpenStack/K8S/other cloud, or do you mean that it actually shows up in ifconfig?

I feel that 162.x.x.x is not an IP that can be bound locally.

You are right! When I change the IP to what ifconfig shows, the scheduler can start successfully.
I want to know can I change the port of server or worker because my machine only has several available ports, and my worker runs failed.

worker:
image
image

scheduler:
image

server:
image

@bobzhuyb
Copy link
Member

What do you mean by available ports? Are you referring to the ports allowed by the cloud's security group? The 172.x.x.x IP is an internal IP within a subnet, which should not be affected by security groups... For now I don't think you can specify the worker port. It will bind to a random port. This is ps-lite's logic.

From your screenshot, it does not look like IP binding/connection problem..

@hbsun2113
Copy link
Author

What do you mean by available ports? Are you referring to the ports allowed by the cloud's security group? The 172.x.x.x IP is an internal IP within a subnet, which should not be affected by security groups... For now I don't think you can specify the worker port. It will bind to a random port. This is ps-lite's logic.

From your screenshot, it does not look like IP binding/connection problem..

Agree with you that it is not the port problem. So why did worker crash?

@hbsun2113
Copy link
Author

@bobzhuyb

image

@bobzhuyb
Copy link
Member

Are you using the docker image provided by us or built from code by yourself? Can you confirm that the same build work okay in non-distributed mode?

There is a knob here https://github.com/bytedance/byteps/blob/master/launcher/launch.py#L37 that allows you to run the program with gdb. Hopefully it will give us more info..

@hbsun2113
Copy link
Author

Are you using the docker image provided by us or built from code by yourself? Can you confirm that the same build work okay in non-distributed mode?

There is a knob here https://github.com/bytedance/byteps/blob/master/launcher/launch.py#L37 that allows you to run the program with gdb. Hopefully it will give us more info..

  1. I built from code by myself.

  2. The same build work okay in non-distributed mode:
    image
    and the env is:
    image
    I also notice that:
    image

  3. but if I export BYTEPS_FORCE_DISTRIBUTED=1:
    image

  4. GDB: Sorry I do not have the authority to restart docker container with gdb:
    image

@bobzhuyb
Copy link
Member

bobzhuyb commented Nov 26, 2019

The output you have is just a warning that you can probably ignore. The problem is that byteps launcher uses gdb to start the shell script example/pytorch/start_pytorch_byteps.sh.. Can you avoid using example/pytorch/start_pytorch_byteps.sh and just use the launcher to start a python script in hbsun_run.sh ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants