Release v0.56.0-rc4 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13083742254

📦 Uncategorized

#0: Add a way to specify custom dispatch topology
- PR: #17102
Initialize work_executor_ and set WorkExecutorMode::SYNCHRONOUS in MeshDevice constructor
- PR: #17120
[UMD] Switching to new coord API
- PR: #17003
#0: Increase create heads test coverage for Llama shapes
- PR: #16980
#16503: Optimize semaphore and CB writes
- PR: #16944
Quiet down the CMake output from dependencies
- PR: #17008
#16847: update to address the unaligned noc_async_copy from DRAM to L1
- PR: #17125
remove ND failing big shape in transpose failures that is already tracked and disabled in test_transpose_2d
- PR: #17145
#14898: pass in pad value to transpose in reduce
- PR: #17142
rm -rf build.yaml
- PR: #17150
#16982: Fixing program cache issues with reshape
- PR: #17140
Delete convd_host_weights and update all tests using conv2d
- PR: #16264
Update CODEOWNERS for the public API
- PR: #17149
Aliu/bug fix
- PR: #17151
#0: Move distributed headers into the public API directory
- PR: #17161
[Llama3.2-11b-vision] Add support for text-only inference through generator api
- PR: #17105
Remove references to ARCH_NAME in programming example
- PR: #17182
Enable PR Gate
- PR: #17098
#16806: Fixed watcher assert on reshape in debug mode
- PR: #17152
#0: Fix doc links and make them point to the new location
- PR: #17181
Remove ARCH_NAME references in prog_examples
- PR: #17185
Prefer MOLD over LLD over LD
- PR: #17154
Restore build-wrapper.yaml with updated method
- PR: #17197
[TT-Train] Fix text generation
- PR: #17195
LightMetal - Add Flatbuffers into cmake infra/build as cpm package (#17039)
- PR: #17157
Format broken Kernel APIs Tables
- PR: #17000
#0: Use MeshBuffer to store MeshWorkload kernel binaries
- PR: #17113
Increase rms_norm and layernorm coverage for Llama shapes
- PR: #17180
#17213: update fused and matmul trace sweep tests
- PR: #17214
Add support for reading from / writing to partial buffer regions that are page size aligned for sharded buffers
- PR: #17089
#0: Fix clang-format for dataflow_api.h
- PR: #17234
Kkabilar tt single card perf
- PR: #17231
[FABRIC] ASYNC_WR_ATOMIC_INC
- PR: #17072
#9945: Enable and fix SD device perf test
- PR: #17025
Check context switch pointer for eth cores before resetting
- PR: #17212
Pull llrt.hpp out of public interface
- PR: #17196
Update perf and latest features for llm models (Jan 27)
- PR: #17188
#0: (MINOR) Bump to generate RCs for v0.57.0
- PR: #17252
Remove dead includes of host_api.hpp from ttnn
- PR: #17220
Prevent UNet Shallow perf report entry from being overwritten
- PR: #17235
Fix setup.py for Anaconda
- PR: #17111
Do not run PR Gate on Draft PRs
- PR: #17272
Add a timeout for docker image building
- PR: #17285
LightMetal - New APIs LightMetalBeginCapture() and LightMetalEndCapture() and docs (#17039)
- PR: #17262
#0: Update distributed tests build to account for arch
- PR: #17287
#17227: Make dispatch core order match for single chip 2 CQ and multchip 2 CQ topologies
- PR: #17274
#17215: Add explicit dealloc for mesh buffer
- PR: #17265
#0: Add validation test for dispatched remote circular buffer config to device
- PR: #17233
Remove get_completion_queue_reader_core() API from Device
- PR: #17263
Add resharding to post all gather layernorm/ rms norm op
- PR: #17156
#0: Fix ttnn shared libs build
- PR: #17127
#0: Schedule runs for single card new models tests
- PR: #17141
Implement JointAttention
- PR: #17079
Revert "Add resharding to post all gather layernorm/ rms norm op (#17156)
- PR: #17304
Update memory config when using view op with height sharded tensors
- PR: #17266
#16812: Reordering cbs in reduce_init_delta
- PR: #16981
#17083: Add support for watcher printing phys coords
- PR: #17244
#16945: Add auto retries to post commit on branches
- PR: #16946
Remove CommandQueue redirecting usages straight to HWCQ
- PR: #17219
1D support for tilize/reshape ops
- PR: #17238
#16138: W-broadcasting for sharded tensors
- PR: #17101
#0: Add PR Gate to data pipeline
- PR: #17325
#15174: Re-enable mistral7b demo test after fw upgrade
- PR: #17305
LightMetal - Add LoadTrace() API and move TraceDescriptor out of detail namespace (#17039)
- PR: #17313
#15974: Create device tensors table in report database
- PR: #17293
Privatize dprint_server.hpp
- PR: #17298
Uplift Allocator to be its own class + migrate calls to Allocator APIs
- PR: #17268
Bump CMake in the Docker image
- PR: #17273
Add perf reporting for ccl async mode
- PR: #16658
Fix debug checks for bank assignments when initializing the allocator
- PR: #17357
Add perf report for reduce scatter async
- PR: #17223
#0: expand halo documentation and fix images
- PR: #16802
Fix shard and physical height mismatch in ttnn.convert_to_chw tests
- PR: #17258
#17134: Remove unused components
- PR: #17301
#0: Fix retry comparison which causes endless retries until pass
- PR: #17367
Make runtime_args_ptr const ref to solve clang-tidy error due to 58f9654
- PR: #17364
Use padded shape to construct output memory config in ttnn.view
- PR: #17366
#17322 Remove transpose cpp unit tests
- PR: #17326
#17215: Initial MeshBuffer integration with TTNN
- PR: #17259
fix sub-height height sharded WH transpose by setting output memory config to width sharded
- PR: #17147
#0: Only produce cicd data on workflow runs that are success/failure/cancelled (ignore skipped runs)
- PR: #17371
#0: Use posted writes for profiler.
- PR: #17261
#16149: Add DeviceCommandCalculator to calculate command size
- PR: #17260
#15889: Fix handling of mantissa rounding to respect ties round to even
- PR: #16997
#17312: Fix type error when saving report config to json file
- PR: #17350
Implement ttnn.sampling op for top-k, top-p sampling
- PR: #17136
Test for a bad state before building
- PR: #17379
#0: Fix incorrect assertion introduced in #17259
- PR: #17386
#0: Add missing include for work_split.hpp
- PR: #17390
Replace ttnn::Shape/LegacyShape with SimpleShape in Python
- PR: #17341
#0: Correcting bad dim check in CCL tests
- PR: #17392
[tt-train] Add scatter workaround while proper version is in development
- PR: #17384
[skip-ci] Bump timeout
- PR: #17397
#15414: Read annotation data to determine job-level failure signature and reason
- PR: #17308
#0: TT-Mesh bug fix MeshCQ/MeshWorkload on device indexing
- PR: #17333
#17374: Add concurrency group for _produce-data.yaml
- PR: #17402
Revert "#17374: Add concurrency group for _produce-data.yaml"
- PR: #17404
#0: Fix broken link for "Programming Mesh of Devices" tech report
- PR: #17400
Revert "#15414: Read annotation data to determine job-level failure signature and reason"
- PR: #17409
Debug strange error
- PR: #17381
Extract CommandQueue interface
- PR: #17352
Revert "[skip-ci] Bump timeout (#17397)"
- PR: #17411
Move dispatch_s queries out of the device interface to a dedicated dispatch query layer
- PR: #17320
disable-pr-gate
- PR: #17412
Remove deprecated get_shape() from Moreh ops
- PR: #17355
17059: Update golden functions
- PR: #17133
Ucheema/tt fabric async wr api
- PR: #17249
#0: seperate dispatch constants
- PR: #17340
#0: Add programming examples for TT-Metalium multi-device native APIs
- PR: #17331
Expanded im2col documentation
- PR: #16846
Removing skip_for_blackhole and fix some minor issues for blackhole conv2d
- PR: #17222
Delete tilize cpp tests
- PR: #17329
Page size assertion error
- PR: #16717
#0: Fuse batch slicing with create_qkv_heads
- PR: #16416
Post all gather layernorm/rms norm with resharding
- PR: #17377
Xuncai/all gather fixes
- PR: #17155
#15414: Read annotation data to determine job-level failure signature and reason
- PR: #17410
#0: Change noc_inline_dw_write to use write_at_cmd_buf
- PR: #17330
#0: Update pgm_dispatch_golden.json
- PR: #17435
#16380: add nD support for generic reductions
- PR: #17399
Address clang tidy error
- PR: #17437
Add Qwen2.5 vLLM generator (based on LlamaGenerator), fix batch 1 issue with generator's decode forward
- PR: #17422
Step 1 of normalizing Boost as a dependency
- PR: #17446
Add DeepSeek perf and instructions
- PR: #17445
[Llama3.2-11b-vision] Add max_cross_attn_tokens property to vLLM generator class
- PR: #17401
#0: Add failure signature for runner shutdown
- PR: #17439
#0: fix bug in createqkv
- PR: #17452
#0: Modify DeviceRange semantics to match CoreRange
- PR: #17440
Remove ttnn::Shape and LegacyShape from the codebase
- PR: #17413
Update the .rst to match code
- PR: #17455
Rename ttnn::SimpleShape to ttnn::Shape
- PR: #17454
Update vLLM commit for deepseek-distill-llama70b in README.md
- PR: #17459

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.56.0-rc4

📦 Uncategorized