Releases · tenstorrent/tt-metal

14 Jan 02:06

v0.54.0-rc23

4cfb561

v0.54.0-rc23 Pre-release

Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12759327887

📦 Uncategorized

Isolate tracy
- PR: #16161
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

14 Jan 14:17

github-actions

v0.54.0

924f017

v0.54.0 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12768962484

📦 Uncategorized

Isolate tracy
- PR: #16161
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315
#16066: Add seed param to uniform and bernoulli ops
- PR: #16179
#0: Add StrongType to help creating non-clashing alias types
- PR: #16309
#0: Fix ccl workers not starting
- PR: #16333
#15642: Replace shapes in eltwise
- PR: #15646
Remove old fd init code path
- PR: #16321
Remove more namespace pollution caused by using namespace tt::tt_metal in header file
- PR: #16342
#0: make dependent configs dependent
- PR: #16324
#13643: Extend binary-ng math support to match all primitive binary ops
- PR: #16276
Fix wrong output tensor shape for prod
- PR: #16334
Update CODEOWNERS
- PR: #16358
Add subdevice support to multicore untilize
- PR: #16193
add multi-iteration support to reduce scatter async
- PR: #16294
#16356: Program Dispatch Modifications for MeshWorkload
- PR: #16361
Refactor conv files using clang-format
- PR: #16340
#15338: Fix watcher using the wrong cmd bufs for addr sanitization when using dynamic noc
- PR: #16363
Add cluster-axis API support to reduce scatter
- PR: #16293
split ttnn unit tests 8 ways
- PR: #16382
split ttnn tests into 10 groups
- PR: #16383
#0: Fixes for remote circular buffer synchronization
- PR: #16378
#0: Initial tech report for Sub-Device feature
- PR: #16387
Adapt to tt-system-tools hugepages configuration
- PR: #14396
Further removal of Shape/LegacyShape in order to allow 0D/1D tensors
- PR: #16337
#16134: add test cases for pre-allocated CreateBuffer / ttnn::event_query
- PR: #16135
setting multi-core for tilize with padding
- PR: #16252
reshape assert fix
- PR: #16300
#16165: Add binary SFPU divide init function
- PR: #16250
#15879: supported subcoregrid for createqkv heads
- PR: #15972
Reimplemented dropout as separate op.
- PR: #16328
#16356: Reland Program Dispatch Modifications for MeshWorkload
- PR: #16385
suppport all dim lengths for reduction
- PR: #16247
Check that writes don't go to below the ringbuffer
- PR: #16399
#16390: Move reduce_scatter_async into experimental namespace and enable cluster api tests
- PR: #16407
Typecast in ng
- PR: #16317
Speed up linking for incremental builds.
- PR: #15994
#0: Don't return shared ptrs of global sems/cbs, and directly return the object instead
- PR: #16354
Add support for act_block_h_override to Width Sharded Conv2d
- PR: #16374
#0: Fix CMakeLists
- PR: #16417
Update install_dependencies.sh to install hugepages using tt-system-tools hugepages service
- PR: #15953
delete stale/(now) invalid assert after recent update to use virtual …
- PR: #16313
Fix CB Overflow issue on certain transposes and permutes
- PR: #16155
Removing LegacyShape from Tensor::pad
- PR: #16424
Add experimental APIs to access Hal
- PR: #16426
Remove documenation references to "setup_hugepages.py"
- PR: #16428
#16175: Add DPRINT TileSlice support for int types
- PR: #16413
Fix remaining minor input/output issues with TG-Llama3 vLLM integration
- PR: #16437
#0: Reshuffle some logic in resize_remote_sender/receiver_cb_interface to fix perf degradation in some models
- PR: #16436
Move conv specific ops from tensor_utils to conv2d.
- PR: #16373
Support all ND shapes for tilize/untilize
- PR: #16299
Remove unused ARCH_NAME specific includes "eth_l1_address_map.h"
- PR: #16445
#0: Fix failing test case for width sharded non-32 multiple output width
- PR: #16224
#15605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16401
#16339: parameterize dispatch_constants
- PR: #16355
Ucheema/tt fabric arch md
- PR: #16456
Add ttnn.conv2d unit tests for UNet Shallow at groups=4,6,8
- PR: #16452
Pad greater than 4D
- PR: #16453
[tt-train] Memory efficient option to run GPT2
- PR: #16205
#15732: add matmul block h/w parameter processing
- PR: #15938
#0: Enable unity for sublibraries
- PR: #16450
Remove redundant function determine_parallel_config_non_tile_mul_width.
- PR: #15955
Add support for tiled indices via padding/alignment aware embedding kernel (tiled indices only)
- PR: #16296
Bw sharded sweeps: neg_bw, log_bw, relu_bw, relu6_bw, leaky_relu_bw, rsqrt_bw
- PR: #16344
Conv2dConfig reallocate_halo_output default to true
- PR: #16185
[Llama3] Change prefill padding in LlamaGenerator to nearest 2048 and optimize chunked prefill readback
- PR: #16472
Added check for global non-constexpr uint64_t value in kernel
- PR: #16476
Update CONTRIBUTING.md
- PR: #16475
Dedicated target for HostDevCommon
- PR: #16493
Fix bug when calling CreateDevice in a loop on TG
- PR: #16260
Fix cb allocation errors for halo and conv2d
- PR: #16190
The library is the authority on include dir locations, not the consumers
- PR: #16164
#0: fix corerange handling in ROPE
- PR: #16444
undo revert of #16247
- PR: #16430
#16495: update test pccs after matmul changes and skip test with ND PCC failure
- PR: #16498
Reserve vector in cluster function
- PR: #16507
Xuncai/flash decode bugfix
- PR: #16362

Assets 9

13 Jan 02:08

github-actions

v0.54.0-rc22

1a7e545

v0.54.0-rc22 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12739066348

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Isolate tracy
- PR: #16161
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

11 Jan 02:07

github-actions

v0.54.0-rc21

ca2c867

v0.54.0-rc21 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12719934369

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

10 Jan 02:07

github-actions

v0.54.0-rc20

14dac66

v0.54.0-rc20 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12701599118

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Python -> Python3
- PR: #16063
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

08 Jan 02:06

github-actions

v0.54.0-rc19

924f017

v0.54.0-rc19 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12662398466

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Python -> Python3
- PR: #16063
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

07 Jan 02:28

github-actions

v0.54.0-rc18

bf94433

v0.54.0-rc18 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12643496109

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
#0: Remove some dead code
- PR: #16084
Updated installation script
- PR: #16101
Python -> Python3
- PR: #16063
Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
Adding ND support for tilize/untilize with padding
- PR: #15933
[Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
#0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
#15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
Remove setup_core_to_tlb_map
- PR: #16048
#0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

06 Jan 02:30

github-actions

v0.54.0-rc17

cb02e39

v0.54.0-rc17 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12624900279

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
#0: Remove some dead code
- PR: #16084
Updated installation script
- PR: #16101
Python -> Python3
- PR: #16063
Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
Adding ND support for tilize/untilize with padding
- PR: #15933
[Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
#0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
#15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
Remove setup_core_to_tlb_map
- PR: #16048
#0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

04 Jan 02:28

github-actions

v0.54.0-rc16

fe6a3da

v0.54.0-rc16 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12606309953

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
- PR: #16086
[UMD] Removed set_*_params calls and constants
- PR: #15908
#0: Remove some dead code
- PR: #16084
Updated installation script
- PR: #16101
Python -> Python3
- PR: #16063
Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
Adding ND support for tilize/untilize with padding
- PR: #15933
[Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
#0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
#15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
Remove setup_core_to_tlb_map
- PR: #16048
#0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

03 Jan 02:04

github-actions

v0.54.0-rc15

d34a23e

v0.54.0-rc15 Pre-release

Pre-release

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12591326491

📦 Uncategorized

Add buffering to DPRINT
- PR: #15677
Revert "#0: Fix merge conflicts originating from #15289"
- PR: #16080
Revert "Link Tensor.reshape to ttnn.reshape"
- PR: #16081
#15061: Implement multi-device tensor distribution APIs in terms of C++ ttnn tensors
- PR: #15886
#0: Allow ttnn.pad to pad Tensor to an odd width in row major
- PR: #16079
#15565 Add unit test to show sharding ttnn.from_torch problems
- PR: #15827
#14977: conv config to use higher cores.
- PR: #15962
Revert "#15565 Add unit test to show sharding ttnn.from_torch problems"
- PR: #16086
[UMD] Removed set_*_params calls and constants
- PR: #15908
#0: Remove some dead code
- PR: #16084
Updated installation script
- PR: #16101
Python -> Python3
- PR: #16063
Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute
- PR: #15881
Adding ND support for tilize/untilize with padding
- PR: #15933
[Llama3.2-11b vLLM Integration] Add support for paged cross attention, fixes for continuous batching, simplified decode forward call
- PR: #16076
#0: Enable Local Sweeps and Use a Faster Interprocess Queue
- PR: #16098
#15601: Implement support for MeshDevice::reshape(..)
- PR: #16029
Remove setup_core_to_tlb_map
- PR: #16048
#0: Let sharded_to_interleaved handle interleaved input
- PR: #16116
#0: separate validation of conv weight and bias.
- PR: #15990
#0: Minor refactor of pytensor and tensor implementation files
- PR: #16108
C++ files should not be part of the API of a library
- PR: #16123
#15857: Forge sweep test
- PR: #15858
#15857: Unary forge sweep tests
- PR: #15901
Fix some more namespace pollution caused by using namespace tt::tt_metal
- PR: #16090
#15713 Bad Eltwise Binary ZEROACC
- PR: #16094
#15565 Fix unit test to show sharding ttnn.from_torch problems
- PR: #16088
Fix paged SDPA decode CB sizing issue
- PR: #16059
Reland async dispatch with workaround for hang.
- PR: #16121
#16119: Add forge traces to matmul and reduce sweeps
- PR: #16139
#10034: Binary shift operators
- PR: #16055
#0: Remove incorrect memory span assert
- PR: #16136
Add forge sweeps for slice and transpose
- PR: #16112
#0: Move memory config serialization in the corresponding header away from types.hpp
- PR: #16151
#16114: Allow Binarized Programs to be Reused across WH Devices
- PR: #16120
#0: aligning conv2d transpose as conv
- PR: #16128
support missing cases for sweep tests
- PR: #15804
#0: added normalization details in the tech report
- PR: #15124
Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #15882
Port all Moreh OPs to compute_output_specs
- PR: #16160
Bump umd to fix grayskull cluster bug
- PR: #16126
Clean-up the usage of deallocate_activation
- PR: #16099
llm tech report multi device section
- PR: #16180
Add prefill v decode section to LLM tech report [section 3.2]
- PR: #15096
#0: Update eltwise binary to support sharding on arbitrary cores on an arbitrary sub-device grid
- PR: #16024
[LLM tech report] Add accuracy evaluation and debugging sections
- PR: #15190
#16165: Disabling test that depends on some machine state to pass
- PR: #16166
enable dps ops for matmul
- PR: #15285
Isolate tracy
- PR: #16161
[TT-Train ]added tests for sum and mean
- PR: #16152
#16184: Try using ecr to avoid rate limits of docker.io
- PR: #16201
#15221: Post completion messages to dispatch_s
- PR: #16187
[TT-Train] Added softmax backward
- PR: #16168
Optimized FreeList allocator
- PR: #15536
Set the test data to be relative to the test binary
- PR: #16150
#0: Fix matmul doc string
- PR: #16208
#0: remove spammy warning from conftest
- PR: #16198
Update generating unicast go signal commands to ensure dispatch write linear respects alignment
- PR: #16117
LLM tech report sections 2.2, 2.5
- PR: #15121
[TT-Train] Fix tracy deps in the tt-train cmake
- PR: #16209
Updating Allocator docs to explain first fit usage
- PR: #16214
Adding asserts for hanging cases in ND tilize/untilize support
- PR: #16170
Fix ttnn.reallocate when unaligned RM tensors are used
- PR: #16192
#15891: improve full accuracy and fix full bugs
- PR: #16182
Revert "Fix ttnn.from_torch for 0D/1D tensors with tile layout (#15882)"
- PR: #16222
#15857: Skip abs forge for GS
- PR: #16221
#16213: Use our own forked Docker Run Action that points to ECR
- PR: #16219
Add max kernel size for each risc type in an op
- PR: #16203
Infer Conv2dTranspose parameters during model preprocessing
- PR: #16028
#12662: add keepdim fixes to reduce
- PR: #16163
Add chunked prefill to Llama family
- PR: #16111
#15342: Add mirror_kernels option to conv_transpose2d
- PR: #15995
Update CODEOWNERS
- PR: #16196
support reduction for 3d & 4d dims
- PR: #16236
#5605: Only force-stall ethernet programs on earlier ethernet programs
- PR: #16202
Add full support for creating tensors with logical sharding from python
- PR: #16072
update llama 3.1 70b v0 tt-metal and vllm commit refs in docs
- PR: #16246
#15857: Binary Forge Sweep Tests Set2
- PR: #16087
#14976/#15039: Add Support For ceil_mode=True
- PR: #16124
Add missing cache invalidates + loads before stores noc optimization for BH
- PR: #16037
Initial CCL Rewrite Push (Unblocks Parallelization of Efforts and Some TG Llama integration)
- PR: #16026
New FD Init Flow
- PR: #15406
Add support for output sharded embeddings
- PR: #16237
Revert "#5605: Only force-stall ethernet programs on earlier ethernet programs"
- PR: #16257
#0: Enforce tile layout when using bf4/bf8 data types
- PR: #16199
MeshDevice: Support Quanta Galaxy system file
- PR: #16239
Move Device members from public to private
- PR: #16256
Add unary sharded sweeps
- PR: #15300
#0: Added core_grid offset for sharded layernorm
- PR: #16207
fix abs path bug for sweeps tests code
- PR: #16285
#0: Publish TT-Distributed doc under tech_reports
- PR: #16261
#15061: Extended {to,from}_vector to support tilized layout, bf4/8 formats
- PR: #16105
#16265: Remove creation op
- PR: #16269
Fix unsigned arithmetic bugs in reshape ops
- PR: #16253
Fix compile issue for earlier c++ versions
- PR: #16291
#0: Typo fix in TT distributed tech report
- PR: #16308
[Llama3-text vLLM integration] Modify Llama3 text model (new and old codebase) forward apis for vLLM compatibility
- PR: #16292
LLM tech report sections 3.1, 3.4, 3.5
- PR: #15110
LLM Tech report section 4.4
- PR: #15166
Move some Device methods to private section
- PR: #16259
#0: [skip_ci] Update Distributed Tech Report with Discord Server link
- PR: #16314
#15857: Binary Forge Sweep Tests Set1
- PR: #16042
#0: Fix get_dispatch_core_config in conftest.py to not modify the device_params to not affect subsequent tests
- PR: #16290
#0: Remove hardcoded grid width in all_gather and skip test_sharded_matmul test when the device grid size is too small
- PR: #16315

Assets 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

📦 Uncategorized

Releases: tenstorrent/tt-metal

v0.54.0-rc23

📦 Uncategorized

v0.54.0

📦 Uncategorized

v0.54.0-rc22

📦 Uncategorized

v0.54.0-rc21

📦 Uncategorized

v0.54.0-rc20

📦 Uncategorized

v0.54.0-rc19

📦 Uncategorized

v0.54.0-rc18

📦 Uncategorized

v0.54.0-rc17

📦 Uncategorized

v0.54.0-rc16

📦 Uncategorized

v0.54.0-rc15

📦 Uncategorized