What's Changed
- Add PyTorch Conference Europe 2026 events by @choijon5 in #1812
- Fix LFBO surrogate training data duplication by @fulvius31 in #1806
- Move imports to top of file by @jansel in #1804
- [cutedsl] Generalize shape-chain lowering and fix nested matmul argreduce by @jansel in #1787
- Fix docs build failure by @jansel in #1814
- Capping the number of warps to avoid non-performing configs and reducing the autotuning time. by @umechand-amd in #1674
- Add pytorch-probot config to enable retryBot for Helion by @huydhn in #1808
- Update torch_tpu pin by @oulgen in #1816
- avoid nvshmem symm-mem backend by @shunting314 in #1750
- increase signal pad size for dist matmul kernels by @shunting314 in #1753
- move distributed runtime utils out of examples/ folder by @shunting314 in #1771
- Increasing block size dimensions to avoid configs which are slow and poor candidates. by @umechand-amd in #1677
- Cleanup the tests skips on ROCM. by @umechand-amd in #1820
- Fix Pallas buffer donation: only donate in-place mutated tensors by @thcmbs in #1802
- Gracefully handle the error, "failed to translate module to LLVM IR" by @umechand-amd in #1793
- fix to prevent incorrect output in test_print by @umechand-amd in #1821
- HELION_FORCE_AUTOTUNE: skip cache read but write back result by @fulvius31 in #1815
- Support block_ptr/TensorDescriptor with extra_mask for loads by @hinriksnaer in #1768
- [autotuner] Add benchmark_batch as unified entry point for autotuner benchmarking by @hinriksnaer in #1810
- restrict to persistent pid for dist kernels by @shunting314 in #1772
- remove hl.signal/wait by @shunting314 in #1791
- cleanup the autotuner factory function by @shunting314 in #1770
- properly codegen hl.triton_kernel by @shunting314 in #1797
- move compile time dist APIs to _dist_utils.py by @shunting314 in #1799
- Use a larger timeout setting to autotune distributed kernels by @shunting314 in #1800
- [autotuner] removed unused duplicate function
check_config_consistancyby @hinriksnaer in #1829 - chore: Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #1826
- [pallas-tpu] Add tests for previously untested examples by @v0i0 in #1665
- [cutedsl] Enable more tests by @jansel in #1822
- Unskipping tests on ROCM by @umechand-amd in #1831
- [Helion + torch.compile] Fix dynamic shapes in HOP path and enable test_dynamic_shapes_basic by @yf225 in #1836
- fix regress caused by large block size in dist autotuner by @shunting314 in #1834
- Revert "Add scheduled workflow to rerun GPU health check failures (#1683)" by @huydhn in #1833
- rendezvous only once for matmul reduce scatter by @shunting314 in #1824
- Refactor: use backend *_expr methods in PointerIndexingStrategy by @aditvenk in #1850
- Fix #1081: Skip failed configs in parallel benchmark by @tianrengao in #1673
- Fix pre-commit pyrefly on macOS by sharing ignore-missing-imports logic by @aditvenk in #1858
- Revert "Capping the number of warps to avoid non-performing configs and reducing the autotuning time. (#1674)" by @jansel in #1860
- [autotuning] Refactor PrecompileFuture construction out of BaseSearch by @hinriksnaer in #1817
- [cutedsl] Fix launch axis mapping and serialized reductions by @jansel in #1840
- [Helion + torch.compile] Always use HOP path, gate fusion with torch_compile_fusion kernel setting by @yf225 in #1837
- [cutedsl] Fix CuTe shape/view lowering and loop-carried matmul accumulation by @jansel in #1856
- [Pallas] Fix block_spec_info grid/loop_order mismatch by @norx1991 in #1788
- add script to make tpu wheels by @v0i0 in #1818
- chore: Bump actions/configure-pages from 5 to 6 by @dependabot[bot] in #1882
- Fix Pallas BMM block size alignment via new dim_matches method by @thcmbs in #1828
- Remove FROM_DEFAULT, unify initial population on FROM_BEST_AVAILABLE with pad control by @fulvius31 in #1809
- [Helion + torch.compile] Reject static_shapes=True with dynamic=True to prevent wrong results by @yf225 in #1864
- [Helion + torch.compile] Add bench_compile_config protocol for autotuning by @yf225 in #1865
- Add pallas_attention kernel and default-loop-type test by @v0i0 in #1866
- Skip Triton-specific batch dim constraint for Pallas matmul by @v0i0 in #1867
- [autotune] Migrate precompile future to a dedicated file location by @hinriksnaer in #1872
- [Pallas] Enable test_use_block_size_var_without_hl_tile Pallas test which already passes, removing triton-specific code checks by @AmesingFlank in #1880
- torch_tpu: point to new repo location by @cota in #1886
- [pallas-tpu] treat sublane constraint like lane constraint by @v0i0 in #1887
- Add epilogue subtiling pass by @choijon5 in #1838
- [Pallas] Enable test_batch_softmax for Pallas backend by @norx1991 in #1889
- fix ref eager subtile tests by @v0i0 in #1899
- use post unfiication symbols during indexing by @v0i0 in #1896
- Fix _output_indices to recognize views of inputs by @v0i0 in #1868
- Add emit_pipeline loop-carried state for Pallas attention by @v0i0 in #1869
- Add fori_loop loop-carried state and refactor shared helpers by @v0i0 in #1870
- [metal] Add MetalBackend method overrides for codegen by @aditvenk in #1852
- add CodegenDict to better handle common codegen function by @shunting314 in #1902
- [Helion + torch.compile] Add extra_cache_key() and is_cacheable() to _AutotunableKernel protocol by @yf225 in #1903
- Skip overridden config indices during autotuner neighbor generation by @fulvius31 in #1884
- [Pallas] Fix if branches which modify nonlocal variables, unblocking test_grpo_loss_fwd by @AmesingFlank in #1892
- [Pallas] Add HelionPallasPrinter to avoid triton_helpers in codegen by @norx1991 in #1900
- [pallas] Add interpret=True support for CPU reference mode by @v0i0 in #1841
- [Helion + torch.compile] Move extra_cache_key to its own field in LooseAutotuneCacheKey by @yf225 in #1906
- [Helion + torch.compile] Add fusion-aware autotuning tests (temporarily skipped) by @yf225 in #1904
- [Autotuner] Handle CUDA OOM errors gracefully during autotuning by @yf225 in #1912
- [Helion + torch.compile] Switch to fusion-aware autotuning by @yf225 in #1801
- Cap maxnreg to fit register file budget for num_warps by @v0i0 in #1918
- use the user provide process group for autotuning by @shunting314 in #1823
- disable the flex attention reference in the attention example on tpus by @v0i0 in #1929
- Refactor RNG op handling to be backend-independent by @jansel in #1871
- [cutedsl] Fix tile-origin indexing and atomic/matmul correctness by @jansel in #1873
- Speed up tests by @jansel in #1897
- add max_mismatch_pct argument to run_example by @shunting314 in #1909
- scaled matmul reduce scatter by @shunting314 in #1842
- Use warps_to_threads() instead of hardcoded warp size in maxnreg cap by @v0i0 in #1930
- [Helion + torch.compile] Replace skipTest guards with error assertions for fusion tests by @yf225 in #1932
- [Pallas] Add test for OOB slice when reduction_loops doesn't divide dim by @norx1991 in #1937
- Include extra specialization in
specialization_keyby @hinriksnaer in #1883 - tpu run script by @v0i0 in #1943
- [Pallas] Use fori_loop if loop bounds includes non-constexpr symbolic values by @AmesingFlank in #1927
New Contributors
- @thcmbs made their first contribution in #1802
- @AmesingFlank made their first contribution in #1880
- @cota made their first contribution in #1886
Full Changelog: v0.3.3...v1.0.0