Release v1.0.0 · pytorch/helion

What's Changed

Add PyTorch Conference Europe 2026 events by @choijon5 in #1812
Fix LFBO surrogate training data duplication by @fulvius31 in #1806
Move imports to top of file by @jansel in #1804
[cutedsl] Generalize shape-chain lowering and fix nested matmul argreduce by @jansel in #1787
Fix docs build failure by @jansel in #1814
Capping the number of warps to avoid non-performing configs and reducing the autotuning time. by @umechand-amd in #1674
Add pytorch-probot config to enable retryBot for Helion by @huydhn in #1808
Update torch_tpu pin by @oulgen in #1816
avoid nvshmem symm-mem backend by @shunting314 in #1750
increase signal pad size for dist matmul kernels by @shunting314 in #1753
move distributed runtime utils out of examples/ folder by @shunting314 in #1771
Increasing block size dimensions to avoid configs which are slow and poor candidates. by @umechand-amd in #1677
Cleanup the tests skips on ROCM. by @umechand-amd in #1820
Fix Pallas buffer donation: only donate in-place mutated tensors by @thcmbs in #1802
Gracefully handle the error, "failed to translate module to LLVM IR" by @umechand-amd in #1793
fix to prevent incorrect output in test_print by @umechand-amd in #1821
HELION_FORCE_AUTOTUNE: skip cache read but write back result by @fulvius31 in #1815
Support block_ptr/TensorDescriptor with extra_mask for loads by @hinriksnaer in #1768
[autotuner] Add benchmark_batch as unified entry point for autotuner benchmarking by @hinriksnaer in #1810
restrict to persistent pid for dist kernels by @shunting314 in #1772
remove hl.signal/wait by @shunting314 in #1791
cleanup the autotuner factory function by @shunting314 in #1770
properly codegen hl.triton_kernel by @shunting314 in #1797
move compile time dist APIs to _dist_utils.py by @shunting314 in #1799
Use a larger timeout setting to autotune distributed kernels by @shunting314 in #1800
[autotuner] removed unused duplicate function check_config_consistancy by @hinriksnaer in #1829
chore: Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #1826
[pallas-tpu] Add tests for previously untested examples by @v0i0 in #1665
[cutedsl] Enable more tests by @jansel in #1822
Unskipping tests on ROCM by @umechand-amd in #1831
[Helion + torch.compile] Fix dynamic shapes in HOP path and enable test_dynamic_shapes_basic by @yf225 in #1836
fix regress caused by large block size in dist autotuner by @shunting314 in #1834
Revert "Add scheduled workflow to rerun GPU health check failures (#1683)" by @huydhn in #1833
rendezvous only once for matmul reduce scatter by @shunting314 in #1824
Refactor: use backend *_expr methods in PointerIndexingStrategy by @aditvenk in #1850
Fix #1081: Skip failed configs in parallel benchmark by @tianrengao in #1673
Fix pre-commit pyrefly on macOS by sharing ignore-missing-imports logic by @aditvenk in #1858
Revert "Capping the number of warps to avoid non-performing configs and reducing the autotuning time. (#1674)" by @jansel in #1860
[autotuning] Refactor PrecompileFuture construction out of BaseSearch by @hinriksnaer in #1817
[cutedsl] Fix launch axis mapping and serialized reductions by @jansel in #1840
[Helion + torch.compile] Always use HOP path, gate fusion with torch_compile_fusion kernel setting by @yf225 in #1837
[cutedsl] Fix CuTe shape/view lowering and loop-carried matmul accumulation by @jansel in #1856
[Pallas] Fix block_spec_info grid/loop_order mismatch by @norx1991 in #1788
add script to make tpu wheels by @v0i0 in #1818
chore: Bump actions/configure-pages from 5 to 6 by @dependabot[bot] in #1882
Fix Pallas BMM block size alignment via new dim_matches method by @thcmbs in #1828
Remove FROM_DEFAULT, unify initial population on FROM_BEST_AVAILABLE with pad control by @fulvius31 in #1809
[Helion + torch.compile] Reject static_shapes=True with dynamic=True to prevent wrong results by @yf225 in #1864
[Helion + torch.compile] Add bench_compile_config protocol for autotuning by @yf225 in #1865
Add pallas_attention kernel and default-loop-type test by @v0i0 in #1866
Skip Triton-specific batch dim constraint for Pallas matmul by @v0i0 in #1867
[autotune] Migrate precompile future to a dedicated file location by @hinriksnaer in #1872
[Pallas] Enable test_use_block_size_var_without_hl_tile Pallas test which already passes, removing triton-specific code checks by @AmesingFlank in #1880
torch_tpu: point to new repo location by @cota in #1886
[pallas-tpu] treat sublane constraint like lane constraint by @v0i0 in #1887
Add epilogue subtiling pass by @choijon5 in #1838
[Pallas] Enable test_batch_softmax for Pallas backend by @norx1991 in #1889
fix ref eager subtile tests by @v0i0 in #1899
use post unfiication symbols during indexing by @v0i0 in #1896
Fix _output_indices to recognize views of inputs by @v0i0 in #1868
Add emit_pipeline loop-carried state for Pallas attention by @v0i0 in #1869
Add fori_loop loop-carried state and refactor shared helpers by @v0i0 in #1870
[metal] Add MetalBackend method overrides for codegen by @aditvenk in #1852
add CodegenDict to better handle common codegen function by @shunting314 in #1902
[Helion + torch.compile] Add extra_cache_key() and is_cacheable() to _AutotunableKernel protocol by @yf225 in #1903
Skip overridden config indices during autotuner neighbor generation by @fulvius31 in #1884
[Pallas] Fix if branches which modify nonlocal variables, unblocking test_grpo_loss_fwd by @AmesingFlank in #1892
[Pallas] Add HelionPallasPrinter to avoid triton_helpers in codegen by @norx1991 in #1900
[pallas] Add interpret=True support for CPU reference mode by @v0i0 in #1841
[Helion + torch.compile] Move extra_cache_key to its own field in LooseAutotuneCacheKey by @yf225 in #1906
[Helion + torch.compile] Add fusion-aware autotuning tests (temporarily skipped) by @yf225 in #1904
[Autotuner] Handle CUDA OOM errors gracefully during autotuning by @yf225 in #1912
[Helion + torch.compile] Switch to fusion-aware autotuning by @yf225 in #1801
Cap maxnreg to fit register file budget for num_warps by @v0i0 in #1918
use the user provide process group for autotuning by @shunting314 in #1823
disable the flex attention reference in the attention example on tpus by @v0i0 in #1929
Refactor RNG op handling to be backend-independent by @jansel in #1871
[cutedsl] Fix tile-origin indexing and atomic/matmul correctness by @jansel in #1873
Speed up tests by @jansel in #1897
add max_mismatch_pct argument to run_example by @shunting314 in #1909
scaled matmul reduce scatter by @shunting314 in #1842
Use warps_to_threads() instead of hardcoded warp size in maxnreg cap by @v0i0 in #1930
[Helion + torch.compile] Replace skipTest guards with error assertions for fusion tests by @yf225 in #1932
[Pallas] Add test for OOB slice when reduction_loops doesn't divide dim by @norx1991 in #1937
Include extra specialization in specialization_key by @hinriksnaer in #1883
tpu run script by @v0i0 in #1943
[Pallas] Use fori_loop if loop bounds includes non-constexpr symbolic values by @AmesingFlank in #1927

New Contributors

@thcmbs made their first contribution in #1802
@AmesingFlank made their first contribution in #1880
@cota made their first contribution in #1886

Full Changelog: v0.3.3...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!