Skip to content

v1.0.0

Latest

Choose a tag to compare

@oulgen oulgen released this 03 Apr 16:57
· 409 commits to main since this release
a6ba80b

What's Changed

  • Add PyTorch Conference Europe 2026 events by @choijon5 in #1812
  • Fix LFBO surrogate training data duplication by @fulvius31 in #1806
  • Move imports to top of file by @jansel in #1804
  • [cutedsl] Generalize shape-chain lowering and fix nested matmul argreduce by @jansel in #1787
  • Fix docs build failure by @jansel in #1814
  • Capping the number of warps to avoid non-performing configs and reducing the autotuning time. by @umechand-amd in #1674
  • Add pytorch-probot config to enable retryBot for Helion by @huydhn in #1808
  • Update torch_tpu pin by @oulgen in #1816
  • avoid nvshmem symm-mem backend by @shunting314 in #1750
  • increase signal pad size for dist matmul kernels by @shunting314 in #1753
  • move distributed runtime utils out of examples/ folder by @shunting314 in #1771
  • Increasing block size dimensions to avoid configs which are slow and poor candidates. by @umechand-amd in #1677
  • Cleanup the tests skips on ROCM. by @umechand-amd in #1820
  • Fix Pallas buffer donation: only donate in-place mutated tensors by @thcmbs in #1802
  • Gracefully handle the error, "failed to translate module to LLVM IR" by @umechand-amd in #1793
  • fix to prevent incorrect output in test_print by @umechand-amd in #1821
  • HELION_FORCE_AUTOTUNE: skip cache read but write back result by @fulvius31 in #1815
  • Support block_ptr/TensorDescriptor with extra_mask for loads by @hinriksnaer in #1768
  • [autotuner] Add benchmark_batch as unified entry point for autotuner benchmarking by @hinriksnaer in #1810
  • restrict to persistent pid for dist kernels by @shunting314 in #1772
  • remove hl.signal/wait by @shunting314 in #1791
  • cleanup the autotuner factory function by @shunting314 in #1770
  • properly codegen hl.triton_kernel by @shunting314 in #1797
  • move compile time dist APIs to _dist_utils.py by @shunting314 in #1799
  • Use a larger timeout setting to autotune distributed kernels by @shunting314 in #1800
  • [autotuner] removed unused duplicate function check_config_consistancy by @hinriksnaer in #1829
  • chore: Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #1826
  • [pallas-tpu] Add tests for previously untested examples by @v0i0 in #1665
  • [cutedsl] Enable more tests by @jansel in #1822
  • Unskipping tests on ROCM by @umechand-amd in #1831
  • [Helion + torch.compile] Fix dynamic shapes in HOP path and enable test_dynamic_shapes_basic by @yf225 in #1836
  • fix regress caused by large block size in dist autotuner by @shunting314 in #1834
  • Revert "Add scheduled workflow to rerun GPU health check failures (#1683)" by @huydhn in #1833
  • rendezvous only once for matmul reduce scatter by @shunting314 in #1824
  • Refactor: use backend *_expr methods in PointerIndexingStrategy by @aditvenk in #1850
  • Fix #1081: Skip failed configs in parallel benchmark by @tianrengao in #1673
  • Fix pre-commit pyrefly on macOS by sharing ignore-missing-imports logic by @aditvenk in #1858
  • Revert "Capping the number of warps to avoid non-performing configs and reducing the autotuning time. (#1674)" by @jansel in #1860
  • [autotuning] Refactor PrecompileFuture construction out of BaseSearch by @hinriksnaer in #1817
  • [cutedsl] Fix launch axis mapping and serialized reductions by @jansel in #1840
  • [Helion + torch.compile] Always use HOP path, gate fusion with torch_compile_fusion kernel setting by @yf225 in #1837
  • [cutedsl] Fix CuTe shape/view lowering and loop-carried matmul accumulation by @jansel in #1856
  • [Pallas] Fix block_spec_info grid/loop_order mismatch by @norx1991 in #1788
  • add script to make tpu wheels by @v0i0 in #1818
  • chore: Bump actions/configure-pages from 5 to 6 by @dependabot[bot] in #1882
  • Fix Pallas BMM block size alignment via new dim_matches method by @thcmbs in #1828
  • Remove FROM_DEFAULT, unify initial population on FROM_BEST_AVAILABLE with pad control by @fulvius31 in #1809
  • [Helion + torch.compile] Reject static_shapes=True with dynamic=True to prevent wrong results by @yf225 in #1864
  • [Helion + torch.compile] Add bench_compile_config protocol for autotuning by @yf225 in #1865
  • Add pallas_attention kernel and default-loop-type test by @v0i0 in #1866
  • Skip Triton-specific batch dim constraint for Pallas matmul by @v0i0 in #1867
  • [autotune] Migrate precompile future to a dedicated file location by @hinriksnaer in #1872
  • [Pallas] Enable test_use_block_size_var_without_hl_tile Pallas test which already passes, removing triton-specific code checks by @AmesingFlank in #1880
  • torch_tpu: point to new repo location by @cota in #1886
  • [pallas-tpu] treat sublane constraint like lane constraint by @v0i0 in #1887
  • Add epilogue subtiling pass by @choijon5 in #1838
  • [Pallas] Enable test_batch_softmax for Pallas backend by @norx1991 in #1889
  • fix ref eager subtile tests by @v0i0 in #1899
  • use post unfiication symbols during indexing by @v0i0 in #1896
  • Fix _output_indices to recognize views of inputs by @v0i0 in #1868
  • Add emit_pipeline loop-carried state for Pallas attention by @v0i0 in #1869
  • Add fori_loop loop-carried state and refactor shared helpers by @v0i0 in #1870
  • [metal] Add MetalBackend method overrides for codegen by @aditvenk in #1852
  • add CodegenDict to better handle common codegen function by @shunting314 in #1902
  • [Helion + torch.compile] Add extra_cache_key() and is_cacheable() to _AutotunableKernel protocol by @yf225 in #1903
  • Skip overridden config indices during autotuner neighbor generation by @fulvius31 in #1884
  • [Pallas] Fix if branches which modify nonlocal variables, unblocking test_grpo_loss_fwd by @AmesingFlank in #1892
  • [Pallas] Add HelionPallasPrinter to avoid triton_helpers in codegen by @norx1991 in #1900
  • [pallas] Add interpret=True support for CPU reference mode by @v0i0 in #1841
  • [Helion + torch.compile] Move extra_cache_key to its own field in LooseAutotuneCacheKey by @yf225 in #1906
  • [Helion + torch.compile] Add fusion-aware autotuning tests (temporarily skipped) by @yf225 in #1904
  • [Autotuner] Handle CUDA OOM errors gracefully during autotuning by @yf225 in #1912
  • [Helion + torch.compile] Switch to fusion-aware autotuning by @yf225 in #1801
  • Cap maxnreg to fit register file budget for num_warps by @v0i0 in #1918
  • use the user provide process group for autotuning by @shunting314 in #1823
  • disable the flex attention reference in the attention example on tpus by @v0i0 in #1929
  • Refactor RNG op handling to be backend-independent by @jansel in #1871
  • [cutedsl] Fix tile-origin indexing and atomic/matmul correctness by @jansel in #1873
  • Speed up tests by @jansel in #1897
  • add max_mismatch_pct argument to run_example by @shunting314 in #1909
  • scaled matmul reduce scatter by @shunting314 in #1842
  • Use warps_to_threads() instead of hardcoded warp size in maxnreg cap by @v0i0 in #1930
  • [Helion + torch.compile] Replace skipTest guards with error assertions for fusion tests by @yf225 in #1932
  • [Pallas] Add test for OOB slice when reduction_loops doesn't divide dim by @norx1991 in #1937
  • Include extra specialization in specialization_key by @hinriksnaer in #1883
  • tpu run script by @v0i0 in #1943
  • [Pallas] Use fori_loop if loop bounds includes non-constexpr symbolic values by @AmesingFlank in #1927

New Contributors

Full Changelog: v0.3.3...v1.0.0