Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TT-MLIR Uplift] toHost API required to untilize the tensor #1176

Closed
jnie-TT opened this issue Feb 5, 2025 · 4 comments · Fixed by #1194
Closed

[TT-MLIR Uplift] toHost API required to untilize the tensor #1176

jnie-TT opened this issue Feb 5, 2025 · 4 comments · Fixed by #1194
Assignees

Comments

@jnie-TT
Copy link
Contributor

jnie-TT commented Feb 5, 2025

With this change ttmlir#1744 on tt-mlir main, the output tensor layout is now dram-interleaved, tiled by default. Therefore it's up to the user to handle untilizing the tensor when needed. Thus prior to memcpying the runtime tensor, we need to call the toHost API to untilize the tensor

Tensor toHost(Tensor tensor, bool untilize = false);
@jnie-TT jnie-TT changed the title [MLIR Uplift] toHost API required to untilize the tensor [TT-MLIR Uplift] toHost API required to untilize the tensor Feb 5, 2025
@pilkicTT
Copy link
Contributor

pilkicTT commented Feb 5, 2025

@jnie-TT Thanks for opening the issue, I'll work on it tomorrow. We still have some uplift issues to resolve from before tenstorrent/tt-mlir#1744. Once that is complete, I'll unblock this as well...

@pilkicTT
Copy link
Contributor

pilkicTT commented Feb 6, 2025

@jnie-TT It looks like this won't go down so easily... 😄

We are hitting OOM on the device for some of the model tests. For the one i was looking at, it is fails on moving input %arg112 to the device (which is a conv weight).

WIth new tt-mlir:

%arg112: tensor<1000x512x1x1xf32, #ttnn_layout28> {ttir.name = "fc.weight"}
#ttnn_layout28 = #ttnn.ttnn_layout<(d0, d1, d2, d3) -> (d0 * 512 + d1 + d2, d3), <1x1>, memref<16000x1x!tt.tile<32x32, f32>, #dram>, <interleaved>>

Old one:

%arg112: tensor<1000x512x1x1xf32, #ttnn_layout29> {ttir.name = "fc.weight"}
#ttnn_layout29 = #ttnn.ttnn_layout<(d0, d1, d2, d3) -> (d0 * 512 + d1 + d2, d3), <1x1>, memref<512000x1xf32, #system_memory>>

Note: in the program, this weight tensor is moved back to the host and then passed into conv2d.

Here are the generated mlirs:

baseline.txt
reg.txt

@jnie-TT
Copy link
Contributor Author

jnie-TT commented Feb 6, 2025

@pilkicTT yes we are seeing the same on tt-torch, sorry I didn't notify you earlier. I'm currently adding a patch that overrides the input layout of conv weights to host (since they need to be on host initially anyway), and hopefully that fixes it. The issue is because tilizing these weights will increase their size by 1024x in many cases.

Hopefully this will fix most issues, but if other errors pop up it might also be that the model just doesn't fit on single device (we're seeing this for llama 7b on tt-torch). I'll keep you posted, thanks for looking into this!

@kmabeeTT
Copy link
Contributor

kmabeeTT commented Feb 6, 2025

Datapoint - Predrag ran with Jackson's cherry pick, which was merged to tt-mlir few hours ago at tenstorrent/tt-mlir@6578419 and it solved OOM errors, CI passed. I re-kicked off uplift in PR pointing to that commit alongside the needed memcpy change, now running, and clicked auto-merge which will close this ticket if it merges.

kmabeeTT pushed a commit that referenced this issue Feb 7, 2025
- Use 6578419 for default conv weights on host fix
- This uplift contains change which set new defaults for inputs/outputs
(tilized and on device) and needs proper handling on our side.
- The outputs now need to be moved to the host before issuing `memcpy`
call.

Closes #1176
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants