Skip to content

Releases: tenstorrent/tt-metal

v0.68.0-dev20260222

22 Feb 03:51
Immutable release. Only release title and notes can be modified.
ee7b53f

Choose a tag to compare

v0.68.0-dev20260222 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22267362568

📦 Uncategorized

  • Fuse sdpa_reduce_to_all with post_sdpa
  • Decouple FDKernels from MetalContext
  • D2D Socket based Python Op
  • DeepSeek teacher forced accuracy : defer xfail until after accuracy metrics are computed
  • Fix uninitialized KernelHandle in LayerNormForwardKernels
  • Fix uninitialized KernelHandle in SDPA backward program factories
  • Fix dead store in matmul DRAM sharded program factory
  • Fix watcher assert in blitz broadcast
  • Fix core.NullDereference in rotate operation program factories
  • #36020: Add MOE/MLP weights infra and update tests
  • Add position tracking to mla

v0.68.0-dev20260221

21 Feb 10:16
Immutable release. Only release title and notes can be modified.
aa7c4d6

Choose a tag to compare

v0.68.0-dev20260221 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22246649996

📦 Uncategorized

  • Create view MeshBuffer in tensor::view and store root MeshBuffer in DeviceStorage
  • Move cb ids to CTAs for blitz RMSNorm
  • Adding trace and 2cq functionality to pi0 model
  • Remove static variables from dispatch topology init code
  • A balanced traffic pattern for AG minimal.
  • #37464: Update unary LLK Tile API's
  • fix deepseek quad CI
  • [TTTv2] Add LMHead1D module for 1D topologies
  • Fix TTTv2 Galaxy CI: Set default HF_MODEL and simplify MLP2D test memory config
  • Optimise LCM for BH.
  • #37985: Overlap post-SDPA CB memory regions
  • Increase stable diffusion demo test timeout to 600s
  • SDPA decode bug fixes
  • Add MLA SDPA test to tests/didt
  • chore: update LLK submodule to e9428f4
  • Add new didt tests for the SDPA OP
  • Disable SD 1.4 on BH P150 Model perf pipeline
  • Revert "A balanced traffic pattern for AG minimal. (#37878)"
  • Add sub_core_grids to ttnn.pad
  • Disabling n300 and enabling more verbose output in triage tests
  • Non-causal SDPA data movement improvements
  • Fix remaining broken imports after SDPA test migration (#37713)
  • [QSR] Adding missing pieces to run compute kernels
  • Add socket pipeline rate tests
  • [DM] Fix hang in DM Test suite
  • [skip ci] temp skip sentence bert due to #38178
  • Add multicast write noc util to perf report CSV
  • Dumping all debug bus signals for block if any risc inside is broken
  • Update tt-triage instructions in kernel debugging tips
  • [skip ci] Add AI tool restrictions for bug bounty program
  • Revert "Dumping all debug bus signals for block if any risc inside is broken "
  • increase timeouts for longer deepseek tests
  • Add tiered model CI pipelines for multi-SKU unit and e2e testing
  • PDL perf bump
  • SDXL Relax unet PCC thresholds
  • Adding extra-tag to allow override
  • Cleanup sigmoid implementation
  • DeepSeek MOE/MLP fusion with reduce_to_one
  • rebase/update fabric ubench golden results
  • Fix multi-process safety issue with jit_link_additional_processor
  • Set umd-admins as owners of UMD submodule
  • Use consistent hash for JIT build cache paths
  • Switch broadcast noc usage
  • Revert "Use_VC propagation fix version 2 (#36529)"
  • TTNN Tensor Cleanup in preparation of Metal Tensor Split
  • MoE: Selective reduce combine

v0.67.0-rc1

20 Feb 19:59
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

v0.67.0-rc1 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22220491796

📦 Uncategorized

  • Improve custom_mm to performantly cover more shapes and enable transpose
  • changed reshape tensor layout to TILE for deepseek moe_gate
  • MLA Optimizations
  • Add precompiled headers to tt-train for faster compilation
  • Adding uneven output shard support to untilize
  • Fix minor typos in unary max/min comments.
  • Move prefetcher pytest option to avoid breaking CI tests
  • [Gemma3] Fix for gemma3 failing unit tests
  • [GPT-OSS] Add fused op unit tests for MoE
  • Disable stable_diffusion model perf test on blackhole (#37617)
  • Add program configs for Matmul ops in Embedding block to run across 40 cores in the SDXL Refiner
  • [tt-train] Add training log comparison plotting script
  • [skip ci] Enable watcher apc nightly debug
  • Adding test harness to check cache on device compatibility for Deepseek 671B
  • [Watcher] tt-train-cpp-unit tests have new watcher enabled fails due to recent changes
  • chore: update LLK submodule to 346a830
  • removes meta lib dependencies
  • [WATCHER] Following issues are detected when watcher is enabled on BH post commit
  • [skip ci] Add P300-viommu to BHPC multi card fast tests
  • SGLang generator
  • [tt-train] Complete nanoGPT Python impl
  • Add new CI pipeline for Deepseek to test long seq lens and refactor tests
  • Topology Mapper Integration with Topology Solver API
  • Make TP All reduce optional in Post SDPA
  • Fix misleading comment in dataflow_api for multicasts
  • [skip ci] Update llama demo upstream test id's
  • Enable multi-host neighbor-pad and RingAttentionAllGather CCLs
  • LLK API support for 8x32 tilize
  • Upgrade Pillow -> 12.1.1 to fix CVE-2026-25990
  • Fix moreh kernel runtime arg bounds issues (#37193, #37040)

  • Convert Sparse Multicast Static Asserts to Runtime Asserts
  • Do not use internal bh name in builtins
  • Quasar compute API bringup V1.0
  • [Deepseek Blitz] Split q a proj mm on inner dim
  • Reduce to one generic op and fusing it with moe routed expert
  • [TTTv2] Add attention_1d module with comprehensive unit tests
  • Matmul - Add Support for 2D DRAM interleaved in0 + batched height sharded in1
  • Changes for quad module tests CI
  • Subtract grid offset when computing 0-based indices in sharded LN factory
  • Decouple Cluster initialization from HAL
  • Switch llama 8b to DP=4 in vllm nightly
  • A balanced traffic pattern for AG minimal.
  • [skip ci] Remove t3k select pipeline extra-tag inputs
  • #36982: create_q_heads tilizes to 8x32 tiles
  • Enable (very) basic compute kernels
  • Migrate conv operations to free function style
  • Migrate fast dispatch frequent tests to CIv2 runners
  • reduction: migrate to free function binding + generic cleanup
  • Use gh_run_number for Superset dashboard links in Slack notifications
  • Fix race condition in parallel multi-source jit build
  • chore: update LLK submodule to f7cf929
  • Move SDPA and MLA tests from tt_eager/misc to ttnn/operations/sdpa
  • Revert "A balanced traffic pattern for AG minimal. (#36607)"
  • [skip ci] Fix galaxy perf tests yaml (bad merge)
  • [DM] Update data movement multi_interleaved tests
  • SDXL clip encoder perf targets updated
  • Fix timeouts in vllm nightly
  • DeepSeek Blitz moe fusion
  • Generate Welford reciprocals in Python and pass into distributed layernorm ops
  • Fix TTTv2 MLP 1d from model args mismatch + BH Stress test pytest id
  • [skip ci] Fix Package and release workflow
  • Update compute kernel API to reflect new changes to fast tilize
  • Fix timeouts for qwen in vllm nightly
  • [skip ci] Add back missing schedule to BH demos
  • Pool2D Alignment Fixes for Watcher
  • Add LLK_ASSERTs for verifying tile index in dest accumulator
  • Make mm respect first core from subdevice
  • Add TTTv2 rmsnorm module unit tests to T3K e2e pipeline
  • Unify kernel and firmware JIT build deduplication into JitBuildCache
  • fix(sweep): correct lead-models Slack notifier's run context, counts, and alerting
  • Propagating new unpack LLK for reduce ops
  • #37471: Output dtype parameter - fix for fp32 dst mode conflict
  • Add indexes to TTNN report db
  • DeepSeek Blitz MLP fusion
  • [skip ci] Move conv test to run last in upstream didt suite
  • Delete Event as it is unused code
  • Kwerblinski tt/37656 blitz lm head
  • fix processor names in watcher tests
  • Migrate experimental operations to use bind_function template and free functions
  • Reorder device params to fix deepseek tests cache paths
  • Split initialization of various components into their own classes
  • Add CQ_PREFETCH_CMD_RELAY_LINEAR_PACKED_H command
  • H<->D Ops for Blitz + Changes to support Async Slow Dispatch
  • Migrate pool and adaptive pool operations to free function style
  • Halo Check Output Grid Matches Input Grid
  • Expose tile dim reconfig template flag in metal
  • TT-triage device and core hardening
  • Improve venv relocatability for distributed and tt-run env inherit
  • #37896: Fix silu_init for BH
  • Fix broken import in test_deepseek_mla_ops.py after SDPA test migration
  • Add tt_symbiote: PyTorch-to-TTNN transparent acceleration framework
  • [Blitz Decode] Integrate Embedding with H2D
  • #0: Fix noc_async_write_multicast to pass noc when using one packet version
  • Full flash mla for blitz
  • Implement FMOD as LLK op
  • [gpt-oss] batched prefill and prefill tracing
  • [WATCHER]: Fix reader runtime args for idle cores in SDPA decode
  • Fix deepseek test_moe device_params ordering for cache paths
  • [UMD Bump] Automated UMD Bump 09.02.2026
  • Reduce DeepSeek long-seq decoder override to 12288
  • DeepSeekV3 teacher forcing: KV cache + improved refpt generation
  • fix galaxy quick tests
  • latency packet index ack move to back
  • Add fused minimal matmul addcmul operation
  • Update micro op kernels to not use full inits, and use reconfigures + short inits
  • Updates trace region size for Qwen3-32B on Galaxy to avoid running out of memory
  • Add Fabric multi-host test on ExaBox BH Quad
  • Improve model tracer infra
  • [gpt-oss] fix b=1 demo
  • Fix yaml path reading in nano_gpt and mesh shape in autograd
  • ci(sweeps): restrict lead-model slack notifications to scheduled main runs
  • jit: remove redundand unpack bfp format conversion
  • Fix buffer not sharded error in ring matmul 1d unit tests
  • Deepseek: Optimized OP for MoE Gate
  • SDPA reduce to all positional logic
  • Fused rmsnorm allow fp32 stat and rope inputs
  • Fabric Fused Scatter Write + Atomic Increment Messaging
  • Add deepseek decode layer test into galaxy-quick
  • Add DeepSeek V3 B1 demo host interface integration tests
  • [tt_dit] Reduce module cache data size
  • Pipe compute config to reduce scatter
  • Adding ND Sharding Support for the Untilize With Unpadding Op
  • [skip ci] Run test_host_io.py on viommu runners only
  • Add argmax based k=1 sampling micro-op to be used in the fused LM head + sampling layer
  • added demo profiling script and device perf utils
  • [skip ci] Rename workflow and update repository references
  • Increased core count for paged SDPA for Qwen
  • Add GitHub merge queue data workflow
  • Updating slice_write tests to use the ttnn.experimental module
  • use nested skus for deepseek perf test [skip ci]
  • Simplified the way to select a program factory
  • Consolidate JIT-generated descriptor headers
  • add wrapraround for neighbor exchange
  • [QSR] Enable all Neos
  • Remove hostname suffix for TT_METAL_CACHE in ttrun.py
  • Revert "[skip ci] Remove parallelism as we suspect a race condition somewhere"
  • Added support for per-batch sampling params for Whisper
    • PR: #3...
Read more

v0.67.0-dev20260220

20 Feb 14:55
Immutable release. Only release title and notes can be modified.
d5bfb8d

Choose a tag to compare

v0.67.0-dev20260220 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22206237336

📦 Uncategorized

  • Add #ifdef guards to chlkc_descriptors.h
  • Use named CBs in matmul factories and kernels
  • Use regex in watcher assert test to avoid hard-coded line number
  • [QSR]: Fix missing comma in ncrisc_noc_fast_read template
  • [tt_dit] Fix reciprocal tensor reuse in DistributedLayerNorm
  • #37982: Overlap pre-SDPA memory regions
  • [apc break] Revert "Add check for proper configuration of unpacker and packer during init and block (#37265)"
  • [tt-train] softmax_backward kernel implementation
  • [skip ci] Remove dead/unused code from the repo
  • ci: update condition for AI assistant job execution
  • Fixing triage tests on p150b
  • [skip ci] DeepSeek prefill directory
  • Remove unnecessary flushed barrier between data and semaphore multicast in conv ops
  • Revert "[tt-train] softmax_backward kernel implementation (#31580)"
  • [tt-triage] Increase console width for better output formatting
  • Fix forward_prefill calls in Galaxy MLP prefill tests
  • Fix tools test: update watcher assert string to match uppercase BRISC
  • Added tensor dimensional stability to moe prefill gating on Mixtral
  • SDPA Decode Optimization: Tree Reduce
  • Re-enable SD 1.4 on Model perf BH pipeline
  • [umd] Use semver_t::from_wormhole_eth_firmware_tag
  • [build]: Fix #37904 — build_metal.sh fails on Fedora/RHEL
  • [TT-Train] Fix GCC build: qualify self-referential using declarations (#37922)
  • Removed previously used llk_unpack_AB_reduce_init
  • Optimize DeepSeekV3 weight dequantization
  • [skip ci] #0: add two nightly subdirectories to CODEOWNERS
  • Align Wan pipeline to reference
  • Multi-mesh Topology Mapping Utility
  • [Quasar DFB] Update to support running on Tensix and update tile counter assignment to respect remapper rules
  • #38022: add ttnn reduction tests to l2 nightly
  • Add a function to completely tear down metal
  • Allow writing to sharded memory from pinned memory
  • Add DeepSeekV3 B1 demo CLI script
  • Add watcher stack usage support for Quasar
  • Allow usage of freed row/col with slow dispatch
  • Cleanup of untilize and untilize_with_padding nd-sharded reader kernel and factories
  • VADv2 bug fix (ttnn.repeat crashing)
  • Fuse reduce to one with D2H
  • #36020: Add infra and tests for overlapping blitz decode weights

v0.67.0-dev20260219

20 Feb 00:59
Immutable release. Only release title and notes can be modified.
2198847

Choose a tag to compare

v0.67.0-dev20260219 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22163635166

📦 Uncategorized

  • add wrapraround for neighbor exchange
  • [QSR] Enable all Neos
  • Remove hostname suffix for TT_METAL_CACHE in ttrun.py
  • Revert "[skip ci] Remove parallelism as we suspect a race condition somewhere"
  • Added support for per-batch sampling params for Whisper
  • [skip ci] Add exit logic to analyze_validation_results.py to support automation
  • Fixing triage tests on blackhole
  • Fix race in Blitz Flash MLA
  • #37414: Prefill optimised MLA op.
  • disable padding[0] check for conv3d, add test config
  • #37716: Fix block-sharded conv2d producing wrong results with dilation > 1
  • Enabling blackhole triage CI
  • Revert "Enabling blackhole triage CI (#38005)"
  • Improve precision, range, and performance of sin/cos/tan.
  • Clean-up topk and topk sweep tests
  • [GPT-OSS] Experts matmul changes
  • Fix SDPA TT_METAL_WATCHER issues
  • [tt-triage] Add aggregated callstacks script
  • [Quasar] Fix Quasar build: Add return statements
  • Internalize DeepSeek MOE/MLP op looping
  • [skip ci] Add BH WH differential tags to the workflows
  • Use_VC propagation fix version 2
  • Add bfloat8 kv cache update
  • #29206 certain model comparison mode failed for bcast op golden function
  • Bump ttsim version to v1.3.5
  • Fix TensixTestL1ToPCIeAt16BAlignedAddress race condition
  • Fix CB wrapping blocking writer test hang (wrong TRISC core)
  • Add Deepseek 16x32 fast tilize test
  • Fix docker image ubuntu python versions
  • Add check for proper configuration of unpacker and packer during init and block
  • Move decode warmup from vLLM to metal side
  • Fix dynamic noc mode support for blitz mcast
  • 38015: move/fix/remove some eager tests
  • Created matmul lab 3 for universities

v0.66.0-rc16

19 Feb 15:54
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

v0.66.0-rc16 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22163672659

  • no changes

v0.67.0-dev20260218

18 Feb 08:19
Immutable release. Only release title and notes can be modified.
e25b1e7

Choose a tag to compare

v0.67.0-dev20260218 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22121609614

📦 Uncategorized

  • Updates trace region size for Qwen3-32B on Galaxy to avoid running out of memory
  • Add Fabric multi-host test on ExaBox BH Quad
  • Improve model tracer infra
  • [gpt-oss] fix b=1 demo
  • Fix yaml path reading in nano_gpt and mesh shape in autograd
  • ci(sweeps): restrict lead-model slack notifications to scheduled main runs
  • jit: remove redundand unpack bfp format conversion
  • Fix buffer not sharded error in ring matmul 1d unit tests
  • Deepseek: Optimized OP for MoE Gate
  • SDPA reduce to all positional logic
  • Fused rmsnorm allow fp32 stat and rope inputs
  • Fabric Fused Scatter Write + Atomic Increment Messaging
  • Add deepseek decode layer test into galaxy-quick
  • Add DeepSeek V3 B1 demo host interface integration tests
  • [tt_dit] Reduce module cache data size
  • Pipe compute config to reduce scatter
  • Adding ND Sharding Support for the Untilize With Unpadding Op
  • [skip ci] Run test_host_io.py on viommu runners only
  • Add argmax based k=1 sampling micro-op to be used in the fused LM head + sampling layer
  • added demo profiling script and device perf utils
  • [skip ci] Rename workflow and update repository references
  • Increased core count for paged SDPA for Qwen
  • Add GitHub merge queue data workflow
  • Updating slice_write tests to use the ttnn.experimental module
  • use nested skus for deepseek perf test [skip ci]
  • Simplified the way to select a program factory
  • Consolidate JIT-generated descriptor headers

v0.67.0-dev20260217

17 Feb 03:40
Immutable release. Only release title and notes can be modified.
3d4d450

Choose a tag to compare

v0.67.0-dev20260217 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22081829437

📦 Uncategorized

  • [Blitz Decode] Integrate Embedding with H2D
  • #0: Fix noc_async_write_multicast to pass noc when using one packet version
  • Full flash mla for blitz
  • Implement FMOD as LLK op
  • [gpt-oss] batched prefill and prefill tracing
  • [WATCHER]: Fix reader runtime args for idle cores in SDPA decode
  • Fix deepseek test_moe device_params ordering for cache paths
  • [UMD Bump] Automated UMD Bump 09.02.2026
  • Reduce DeepSeek long-seq decoder override to 12288
  • DeepSeekV3 teacher forcing: KV cache + improved refpt generation
  • fix galaxy quick tests
  • latency packet index ack move to back
  • Add fused minimal matmul addcmul operation
  • Update micro op kernels to not use full inits, and use reconfigures + short inits

v0.67.0-dev20260216

16 Feb 03:28
Immutable release. Only release title and notes can be modified.
ecc3ca4

Choose a tag to compare

v0.67.0-dev20260216 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22046210687

📦 Uncategorized

  • Fix broken import in test_deepseek_mla_ops.py after SDPA test migration
  • Add tt_symbiote: PyTorch-to-TTNN transparent acceleration framework

v0.67.0-dev20260215

15 Feb 03:26
Immutable release. Only release title and notes can be modified.
53f7c88

Choose a tag to compare

v0.67.0-dev20260215 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22026945186

📦 Uncategorized

  • fix(sweep): correct lead-models Slack notifier's run context, counts, and alerting
  • Propagating new unpack LLK for reduce ops
  • #37471: Output dtype parameter - fix for fp32 dst mode conflict
  • Add indexes to TTNN report db
  • DeepSeek Blitz MLP fusion
  • [skip ci] Move conv test to run last in upstream didt suite
  • Delete Event as it is unused code
  • Kwerblinski tt/37656 blitz lm head
  • fix processor names in watcher tests
  • Migrate experimental operations to use bind_function template and free functions
  • Reorder device params to fix deepseek tests cache paths
  • Split initialization of various components into their own classes
  • Add CQ_PREFETCH_CMD_RELAY_LINEAR_PACKED_H command
  • H<->D Ops for Blitz + Changes to support Async Slow Dispatch
  • Migrate pool and adaptive pool operations to free function style
  • Halo Check Output Grid Matches Input Grid
  • Expose tile dim reconfig template flag in metal
  • TT-triage device and core hardening
  • Improve venv relocatability for distributed and tt-run env inherit
  • #37896: Fix silu_init for BH