[autotune] Feat: Overlapping GPU benchmarking and CPU compilation#1416
[autotune] Feat: Overlapping GPU benchmarking and CPU compilation#1416hinriksnaer wants to merge 25 commits into
Conversation
|
I love the idea. Would be curious to see whether this overlapping causes any consistency/measurement problems. |
|
My main concern with this would be how stable the perf measurements are. We have had a lot of issues with unstable benchmark results leading to the autotuner making incorrect decisions. Could we benchmark the stddev/min/max/mean/median of the raw perf measurements of a single kernel with and without this change. Right now the autotuner takes the median. We could also possibly offset more noise by running additional trials. |
|
@jansel Currently working on some more standardized benchmarking for autotune to make sure we can better quantify meaningful metrics that address these kinds of concerns. I can tag you in a draft PR tomorrow once it's ready. That way we can get a discussion going to scope out some valuable metrics that could help make these contributions a clearer yes or no. |
|
@hinriksnaer @jansel I think right now a lot of the CI autotuning jobs are not working well, so I'll do my own benchmarking. My take is that if we see the autotuner get identical results faster, we should enable this -- although we prob want to check for smaller shapes and on alternative hardware. I'll also do the data analysis around whether this introduces bias into perf measurements. I have some code set up to extract all perf quantiles. I'll take a look and share the results. |
|
Quick update on this: I ran the B200 benchmarks and found that the overlapping reduces gives an improvement in wall clock time from 0.43x geomean wall-clock reduction compared to pattern search to 0.48x geomean wall-clock time reduction. @hinriksnaer you can run the CI job with your new logging to confirm. The good news is that I don't see any discernible loss in performance due to overlapping. I'll work on the data analysis to see at the config level whether this introduces variance into measurement.
|
|
@hinriksnaer @jansel Just as in the adaptive compile time pr #1384, I wonder if there is a simple check we can do in the initial population benchmarking to see whether compile time overlapping will introduce bias (as this could also vary depending on the user's CPU). Could we introduce a check the compares the benchmark vs the rebenchmark results and if we see that the difference is within some tolerance we keep overlapping (and disable it if not)? |
ad4fa04 to
cde4189
Compare
|
Made some new changes that hopefully address your concerns @jansel There is now a re-benchmarking everything might introduce too much of a bottleneck, so we might want to introduce some heuristic that samples a subset of the configurations instead. |
|
@jansel last test failure was autograd test timeout, don't think it is related to these changes. |
|
I'm looking for more data here on how this effects measurement noise. Not end-to-end autotuning perf, but individual measurement noise. |
|
Converting this to draft for now. Hoping to land some updates to the search abstraction that streamlines the process of swapping out and testing various different compilation/benchmarking strategies. I'll revisit this once apples to apples comparisons and noise measurement is more natively supported. |

related issue #1400
The problem
Autotuning currently wastes a lot of time with idle resources. Here's what happens when we benchmark 200 configs:
The GPU sits idle for while we compile, then the CPU sits idle for while we benchmark. That's a lot of wasted compute.
The solution
Increase hardware utilization by parallel compiling configurations while using the main thread for sequential benchmarking of compiled kernels. This approach does the following iteratively:
This is a simple approach that allows for overlapping GPU and CPU utilization which reduces unused resources during autotuning. More "clever" scheduling can be added later on such as a dedicated cpu scheduling thread in order to unblock CPU scheduling from GPU benchmarking.
Usage
Enable with environment variable:
export HELION_AUTOTUNE_OVERLAP_COMPILATION=1Early Benchmark
Baseline
With Changes
Next steps
I think we should benchmark this change and from there we can add tests and determine if we would like to ship this or add additional layers of complexity for some additional performance gain.