Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[WIP] distributed training #1334

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ZiyueHuang
Copy link
Member

@ZiyueHuang ZiyueHuang commented Sep 1, 2020

Description

Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each has multiple GPU cards. However, there are two issues:

  • We should first call trainer._init_params() before the first iteration (forward/backward) to synchronize the parameters across all workers, otherwise trainer will call _init_params inside allreduce_grads (see mxnet/python/mxnet/gluon/trainer.py), thus the synchronization of the parameters actually happen after the first forward/backward computation, meaning that in the first iteration the gradients on different workers are computed w.r.t. different parameters. But in practice this may not be a severe problem, as only the first gradient descent step is not totally correct.
    When the model in gluon-nlp confirms to the new coding standard (removing defer_init for all parameters @sxjscience ), then we can directly call trainer._init_params after model.initialize.
    Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.

  • I found that we have to build MXNet and BytePS from source on our target machine. Using pip install mxnet (I have tried several wheels in August) will not work for BytePS, either core dump immediately after import bps.mxnet due to undefined symbols (seems because some symbols are not exported into binary, fixed after 0820 wheel), or mysterious segfault (maybe due to c++ ABI issues or other compiler issues, as the mxnet wheels and BytePS (which is setup on our target machine) may not be compiled using the same version of gcc. Besides, there are no related compiler flags such as D_GLIBCXX_USE_CXX11_ABI for MXNet in BytePS's setup.py).

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

@ZiyueHuang ZiyueHuang requested a review from a team as a code owner September 1, 2020 10:08
Copy link
Member

@zheyuye zheyuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and building MXNet from source will work for distributed training

@codecov
Copy link

codecov bot commented Sep 1, 2020

Codecov Report

Merging #1334 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1334   +/-   ##
=======================================
  Coverage   67.34%   67.34%           
=======================================
  Files           2        2           
  Lines          98       98           
=======================================
  Hits           66       66           
  Misses         32       32           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c076093...ec793d5. Read the comment docs.

@ZiyueHuang
Copy link
Member Author

Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (bytedance/byteps#292).

@sxjscience
Copy link
Member

@ZiyueHuang You may merge the upstream/master since we recently fixed the CI.

@szha
Copy link
Member

szha commented Sep 1, 2020

undefined symbols

What are they?

@szhengac
Copy link
Member

szhengac commented Sep 1, 2020

Horovod also has similar issue that the actual broadcast happens after the first iteration. See here

@sxjscience
Copy link
Member

To avoid the problems of horovod, in GluonNLP v1, the convention is to not rely on deferred initialization when implementing the model. For example, for dense layer, we should always give the in_units.

@ZiyueHuang
Copy link
Member Author

ZiyueHuang commented Sep 3, 2020

Sorry for the late reply.

@szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the stack trace

Fatal Error: Segmentation fault
Stack trace:
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.4', 'bpslaunch')
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1464, in run_script
    exec(code, namespace, namespace)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 213, in <module>
    launch_bps()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 206, in launch_bps
    t[i].join()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
    raise self.exc
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 177, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python ./test.py' returned non-zero exit status 139.

@sxjscience @szhengac Horovod doesn't have similar issue, since in hvd.broadcast_parameters when there are defer_init parameters, it will register a hook to param.init_impl (see https://github.com/horovod/horovod/blob/9464e20ab2565f25105d5d7f6194ca90d6fca9e5/horovod/mxnet/__init__.py#L121-L127), thus immediately after the actual initialization, Horovod will broadcast the parameters, then forward/backward. However, there are no similar mechanisms in BytePS to handle the logic of defer_init.

@leezu
Copy link
Contributor

leezu commented Sep 4, 2020

Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.

You can report it as a bug in MXNet issue tracker

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants