-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node training #36
Comments
I wonder if there is something wrong with my method? Multi-machine pipedream can hardly be successfully trained, and it will report an error every time. As the first few questions I raised, can anyone help me |
Not sure what your method is. We were able to successfully run multi-machine PipeDream experiments. Some instructions are here: https://github.com/msr-fiddle/pipedream/blob/master/EXPERIMENTS.md. In particular, look for the *_16gpu.yml files, which are all run using multiple servers. Alternatively, please send me the commands you're trying to run, and I can try helping you out. |
thank you for your reply! server2: |
>Note: The models used next are from your generated models, I just use them to trainwhen i use --config_path models/vgg16/gpus=16_straight/hybrid_conf.jsonit will throw the following error: when i use --config_path models/vgg16/gpus=16/mp_conf.jsonit will throw the following error: rank_to_stage_map {0: 0, 1: 1} but when i use --config_path models/vgg16/gpus=16/hybrid_conf.jsonit will successful!!! So I am very confused, why is there such a result, my environment has not changed in any way, just changed the configuration filemy enviroment: server 1: 8 V100 |
I think in theory, taking your model directly can at least train successfully, but something went wrong |
If I understand correctly, there are some problems in When I use It works well after I modify the partition and let each stage have at least one layer with trainable parameters. I think the reason for this problem should be that the current program cannot handle certain stages that do not contain trainable parameters. |
I hope to add examples of multi node training
The text was updated successfully, but these errors were encountered: