Skip to content

[Autotuner] Add FROM_BEST_AVAILABLE initial population strategy#1365

Merged
jansel merged 104 commits into
pytorch:mainfrom
fulvius31:from-warm-start-test
Mar 3, 2026
Merged

[Autotuner] Add FROM_BEST_AVAILABLE initial population strategy#1365
jansel merged 104 commits into
pytorch:mainfrom
fulvius31:from-warm-start-test

Conversation

@fulvius31
Copy link
Copy Markdown
Collaborator

@fulvius31 fulvius31 commented Jan 30, 2026

Summary

Adds FROM_BEST_AVAILABLE initial population strategy that bootstraps autotuning from previously cached best configs, probably addressing the request for "bootstrapping from a known good config" in #1274.

Target use case: Developers iterating on kernel code who want faster autotuning without trying to not sacrifice kernel performance and without falling back to fixed, pre-defined configs.

How it works

  • Differential Evolution: Starts with default config plus up to 20 matching cached configs from prior runs, fills remainder with random configs to reach population size
  • Pattern Search: Uses default config plus cached configs directly as initial population (no random fill)

Cache matching uses hardware name + normalized specialization key (tensor dtype, device, shape, strides), filtering out code object references so configs transfer across kernel edits.

Benchmark results using 1 cached best_config and default PatternSearch

The kernels used are the one from ~/examples.

Hardware : Nvidia RTX 5090
torch Version: 2.10.0+cu130
helion Version: 0.2.11.dev7+ga7e94e60c
triton Version: 3.6.0+git9844da95

MatMul Benchmark

Strategy Autotune Time Codegen Calls
Full Random 1238s 5462
FROM_DEFAULT 65s 671
FROM_BEST_AVAILABLE (FROM_DEFAULT cache) 95s 1002
FROM_BEST_AVAILABLE (FULL cache) 92s 855
Implementation Full Random FROM_DEFAULT BEST_AVAIL (default) BEST_AVAIL (full)
helion (1) 0.0183ms (0.95x) 0.024ms (0.73x) 0.0186ms (0.91x) 0.0183ms (0.94x)
helion (2) 0.019ms (1.13x) 0.0246ms (0.88x) 0.0204ms (1.06x) 0.019ms (1.13x)
helion (3) 0.0184ms (1.31x) 0.0225ms (1.09x) 0.0225ms (1.11x) 0.0184ms (1.35x)
helion (4) 0.0184ms (1.27x) 0.021ms (1.24x) 0.0214ms (1.21x) 0.0184ms (1.27x)
helion_matmul_autograd 0.0184ms (0.95x) 0.0232ms (0.75x) 0.0187ms (0.90x) 0.0183ms (0.94x)
helion_addmm_autograd 0.0189ms (1.11x) 0.0238ms (0.96x) 0.0211ms (0.99x) 0.0189ms (1.11x)
helion_addmm_autograd_scaled 0.019ms (1.10x) 0.024ms (0.96x) 0.0212ms (0.99x) 0.019ms (1.10x)

Result: FROM_BEST_AVAILABLE with full cache matches Full Random kernel times across all implementations at 13x less tuning cost. FROM_DEFAULT is 19x faster but produces 14-31% slower kernels.

Softmax Benchmark

Strategy Autotune Time Codegen Calls
Full Random 808s 2627
FROM_DEFAULT 32s 220
FROM_BEST_AVAILABLE (FROM_DEFAULT cache) 42s 305
FROM_BEST_AVAILABLE (FULL cache) 53s 423
Implementation Full Random FROM_DEFAULT BEST_AVAIL (default) BEST_AVAIL (full)
Helion Simple 0.02ms (2.34x) 0.0164ms (2.50x) 0.0169ms (2.40x) 0.0191ms (2.33x)
Helion Two Pass 0.0199ms (2.35x) 0.0225ms (1.83x) 0.0168ms (2.43x) 0.0169ms (2.63x)
Helion (Aggregate) 0.0231ms (2.04x) 0.0225ms (1.86x) 0.019ms (2.16x) 0.0231ms (2.07x)

Result: Mixed outcome—FROM_DEFAULT wins for Helion Simple (best kernel at lowest cost), but FROM_BEST_AVAILABLE (default cache) wins for Helion Two Pass.

Key takeaways

  • FROM_BEST_AVAILABLE with good cache matches Full Random quality at ~13x less cost (MatMul: all 7 implementations match or beat)
  • FROM_DEFAULT is fastest but can miss optimal configs when default is far from best (MatMul: 14-31% slower kernels)
  • Cache quality matters: Full effort cache outperforms default cache in MatMul; results vary for Softmax
  • Workload-dependent: When default is already near-optimal (Softmax Simple), cache-seeding adds overhead without benefit

When to use

  • FROM_BEST_AVAILABLE: When iterating on kernel code and you have prior tuning runs for similar shapes/hardware
  • FROM_DEFAULT: Quick iteration when no relevant cache exists or default is known to be good and - in any case - if effort is set to quick, FROM_BEST_AVAILABLE will use FROM_DEFAULT as a config anyway
  • Full Random: Offline profiling when compile time is not a concern

Usage

HELION_AUTOTUNE_EFFORT=quick HELION_AUTOTUNER_INITIAL_POPULATION=from_best_available python example/{matmul.py,softmax.py}

Configuration

Env var Default Description
HELION_BEST_AVAILABLE_MAX_CONFIGS 20 Max cached configs to seed
HELION_BEST_AVAILABLE_MAX_CACHE_SCAN 500 Max cache files to scan

@fulvius31 fulvius31 marked this pull request as draft January 30, 2026 17:39
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 30, 2026
@fulvius31 fulvius31 force-pushed the from-warm-start-test branch 4 times, most recently from 86dc446 to b4d04a7 Compare February 3, 2026 19:51
@fulvius31 fulvius31 changed the title [WIP][Autotuner] Add FROM_BEST_AVAILABLE initial population strategy [Autotuner] Add FROM_BEST_AVAILABLE initial population strategy Feb 3, 2026
@fulvius31 fulvius31 marked this pull request as ready for review February 3, 2026 20:21
@fulvius31
Copy link
Copy Markdown
Collaborator Author

I think it's pretty ready. Could you take a look when you have a moment? @jansel @v0i0 @oulgen

@fulvius31 fulvius31 force-pushed the from-warm-start-test branch 5 times, most recently from 7df1c0a to ff8d8c8 Compare February 6, 2026 16:06
@fulvius31 fulvius31 force-pushed the from-warm-start-test branch from 2c448da to c10607d Compare February 8, 2026 13:27
@fulvius31
Copy link
Copy Markdown
Collaborator Author

@jansel I don't think the failed test is related to this PR

Copy link
Copy Markdown
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fulvius31 can you rebase and resolve the merge conflict? That might also fix the test.

@fulvius31
Copy link
Copy Markdown
Collaborator Author

Test failures related?

@jansel i don't think so. I think there are failing since #1542

@fulvius31 fulvius31 requested a review from jansel February 26, 2026 17:37
Comment thread helion/autotuner/config_spec.py Outdated
Comment thread helion/autotuner/local_cache.py
@fulvius31 fulvius31 requested a review from jansel February 28, 2026 22:41
@fulvius31
Copy link
Copy Markdown
Collaborator Author

@jansel I don't think the tests fail were related to this PR.

Comment thread helion/_compat.py Outdated
@fulvius31 fulvius31 requested a review from jansel March 1, 2026 23:24
@jansel
Copy link
Copy Markdown
Contributor

jansel commented Mar 3, 2026

@fulvius31 can you rebase and fix merge conflicts?

@fulvius31 fulvius31 force-pushed the from-warm-start-test branch from 653df7f to 549bfef Compare March 3, 2026 13:09
@fulvius31
Copy link
Copy Markdown
Collaborator Author

@fulvius31 can you rebase and fix merge conflicts?

@jansel done

@jansel jansel merged commit 5b02214 into pytorch:main Mar 3, 2026
17 of 19 checks passed
nullplay pushed a commit to nullplay/helion that referenced this pull request Mar 17, 2026
umechand-amd pushed a commit to umechand-amd/helion that referenced this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants