Enable Domain Parallelism with ShardTensor #784

coreyjadams · 2025-02-06T01:48:01Z

Modulus Pull Request

Description

This PR adds new capabilities to Modulus:

ShardTensor is an extension to pytorch DTensor that enables uneven sharding of tensors across DeviceMesh objects. While some logical sharding constraints remain, this allows more dynamic and flexible operation on distributed input data, especially in cases where the input data shape and output data shape differ.
ShardTensor also enables an ecosystem of operation extensions. Two major ones are included in this PR: convolutions (1D/2D/3D) and neighborhood attention. When the right components of modulus are imported, these operations (when performed on sharded tensors) will automatically compute halo regions and perform data transfers to enable results consistent with single device outputs.
- For small data, this is not useful, but for extremely large data this is a powerful way to scale training on large inputs.
The documentation for Modulus now includes an API reference for ShardTensor, as well as an example of integrating multiple levels of parallelism by combining shard tensor and pytorch FSDP.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Adds a dependency on wrapt for monkey-patching operations on sharded inputs..

…simple DDP sharding

…ieces are WIP but this has basic functionality supported for creation and forward usage.

…ete.

…t of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly.

…ng in unbind op rules.

….ops.aten.convolution.default.

…s also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed.

…tributed

…gnificant overhead. I'm implementing here an option to switch to peer to peer message passing, since it might benefit from stream utilization in layers like natten.na2d. It's a developer choice currently, not a user choice.

…gnificant functionality changes in this commit.

Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh.

…sary

But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples.

No real code changes applied here.

…ker functions

There appears to be one corner case in redistribute to fix. TBD. Tests for grad propogation are coming.

FSDP and modulus ShardTensor

…tributed

docs/tutorials/fsdp_and_shard_tensor.rst

docs/api/modulus.distributed.shardtensor.rst

docs/tutorials/fsdp_and_shard_tensor.rst

modulus/distributed/__init__.py

modulus/distributed/shard_tensor.py

pzharrington · 2025-02-06T23:00:21Z

\blossom-ci

modulus/distributed/shard_utils/natten_patches.py

pzharrington · 2025-02-07T00:23:22Z

Overall, I think this is looking good, nice work! I started with the documentation and then focused on unit tests to see overall functionality, as well as changes to the DistributedManager to see device mesh functionality and the main changes to what existed previously. Also looked at the ShardTensor definition, halo collectives and conv/natten patches, but didn’t spend much time on the other backend stuff.

Aside from the minor comments added, my main flag is to make the unit testing more complete, but I don’t think that should necessarily block merging. In particular I think for ops that we support (conv or nat, currently), we should add unit tests for correctness compared to a non-sharded baseline for forward and backward passes (subject to within some numerical tolerance, esp. in context of the neighborhood attention numerics we discovered).

@pzharrington

Update tutorial based on feedback from @pzharrington

Remove wildcard import.

fix spacing

coreyjadams · 2025-02-07T16:28:55Z

Thanks for the review @pzharrington! I agree with you on the testing. Here's my thoughts:

Modulus should support unit tests of the basic ShardTensor functionality, "baked in". Most of those are there, but I have locally some tests in development regarding gradient propagation through sharded tensors. I would like to get them in but didn't want to hold the review.
Tests on numerical accuracy are probably too much for CI/CD and unit testing. I am working on exactly the tools you highlighted, but it's currently manually run and analyzed. It aims to support numerical checking (and performance benchmarking!) of all the operations we patch like this in modulus, as well as extending to more comprehensive layers and even full models. I'll get the repo up on gitlab to iron out the kinks and keep the numerical checking untied from the modulus release. FYI, apart from the issues we found in na2d for long sequence lengths, all patched operations are passing numerical checks.

ktangsali · 2025-02-07T17:49:28Z

/multi-gpu-ci

coreyjadams and others added 30 commits December 17, 2024 08:52

Enable mesh-based parallelism as the configuration backend, even for …

6c6a15e

…simple DDP sharding

Fix small typo in docstring

91f2398

Remove unnecessary functions with new interface

e56b0b8

Adding first implementation of ShardTensor prototype. Still several p…

897a464

…ieces are WIP but this has basic functionality supported for creation and forward usage.

Working implementation of ShardTensor, though still somewhate incompl…

f28f537

…ete.

Adding work-in-progress examples. Be careful of sharp edges!

ce99df6

A few more example pieces before natten will work out of the box. Mos…

025d8d2

…t of the ops have been validated, all that remains is to wrap the na2d function call to ensure it will dispatch properly.

Merge branch 'NVIDIA:main' into distributed

b57050a

Fix naming scheme

b0e335c

Minor name change

70b8ce5

Add monkey patching for na2d operation with shard tensors

9f19e36

Fix bug in shard tensor inference of globla size. CHeck agains shardi…

2f06c07

…ng in unbind op rules.

Enable backwards gradients for halo sharding and natten patch

71168a2

Convolution 2d backwards works, though would be better to catch torch…

72af11e

….ops.aten.convolution.default.

Fix missing import and ensure tensors are contiguous before allgather_v

305bad3

Clean up and remove unnecessary noise and printouts for debugging

d79c975

Merge branch 'NVIDIA:main' into distributed

1a4d886

Unify (and correct!) the sharded convolution implementation. There wa…

f7d063a

…s also a minor bug in the backward pass that got more pronounced with smaller data: grad inputs were failing to properly collect haloed gradients and add them on the edges. Now fixed.

Merge branch 'NVIDIA:main' into distributed

03615c5

Remove noise from sharding utils.

350ec41

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

8342905

…tributed

Remove shard_utils file, it is a subfolder.

726ba92

Add modulus ShardTensor api documentation

05ce224

Clean up doc strings, type annotations and mesh implementation. No si…

f84cfd2

…gnificant functionality changes in this commit.

Add significant docstring / type annotation cleanup to ShardTensor.

73464e0

Add `scatter_tensor` function to enable more easy transition to shard tensor. This function allows users to maintain data pipelines (on one rank) and easily scatter that data to a domain mesh.

Remove neighborhood attention prototypes

e4ac9eb

Remove the rest of these examples since they are outdated and unneces…

8a96e7e

…sary

Mostly, this commit is adding type annotations and doc strings.

584857d

But also, this adjusts the shard tensor mechanism for tracking shard info to use a dict instead of a list of tuples.

Clean up and document conv patches.

a8b0592

No real code changes applied here.

coreyjadams and others added 4 commits February 5, 2025 17:37

clean up and improve documentation and type hints for shard utils wor…

5f848a5

…ker functions

Adding basic tests for shard tensor initialization and redistribution.

a0b8f6a

There appears to be one corner case in redistribute to fix. TBD. Tests for grad propogation are coming.

Add full working example of multilevel parallelism with pytorch

631bc9f

FSDP and modulus ShardTensor

Merge branch 'NVIDIA:main' into distributed

3f4a943

coreyjadams requested review from akshaysubr, stadlmax and pzharrington February 6, 2025 01:48

coreyjadams and others added 6 commits February 6, 2025 08:20

Add missing type annotations

0512728

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

04cc895

…tributed

Merge branch 'main' into distributed

6384f4c

Ensure scatter_tensor is available to import from modulus.distributed

dc2b729

Merge branch 'distributed' of github.com:coreyjadams/modulus into dis…

abb18f0

…tributed

Update changelog and ensure wrapt is a optional dependency

7841c2b