-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform padding for multi-device tensors, to allow for homogenous TensorSpec across all devices #17476
Comments
As a first step, I am running a check to see how many of our existing tests and models rely on uneven multi-device sharding. |
Setting as P0 for now, during the investigation phase. It is likely this won't be a blocker, and we will find a workaround that trades off perf for generality. |
I ran the T3K, TG/TGG test suites with the assertion on. There is some noise, it seems that only falcon 7b demo on T3K runs into it. FYI @cfjchu |
Falcon7b known to have uneven shapes so makes sense. Nice to know this was the only model that's affected. Any notable performance regressions ? |
I just added a single |
We had offline conversation with @tt-asaigal @ayerofieiev-tt @jvegaTT @TT-BrianLiu @cfjchu. Padding is challenging, as it requires special handling, e.g. in reduction operations. Instead, we would like to invest in supporting heterogenous runtime args - this approach is similar to how single-device sharding is handled. Until that is fully supported by TT distributed, we can assume same tensor specs across devices. Please add any missing context! |
Thanks for flagging this @omilyutin-tt , this is important. I discussed with @cfjchu and I think we should consider the data-movement and compute logic separately. Here are the options we discussed:
For either option, op infra changes are identical. I personally prefer option 2, since it can make models work out of the box as is. Long term, we would like all ops to be multi-device aware, i.e. replace cc. @davorchap |
Existing TTNN infra allows for uneven multi-device tensor sharding. For example:
That is, sharding a tensor with
(1, 13, 32, 32)
shape across 8 devices results in the last shard being 2x smaller with(1, 1, 32, 32)
shape as opposed to(1, 2, 32, 32)
. The addition operationttnn.add
and all of our ops dispatch infra works just fine.This is a valid use case, but it won't be efficiently supported by the new distributed infrastructure. To enable homogenous workloads that have the same runtime args across devices, we should pad tensors, so that
TensorSpec
remains the same across all devices.The text was updated successfully, but these errors were encountered: