-
Notifications
You must be signed in to change notification settings - Fork 331
[Language]Adds a random number generation capability #1196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe PR introduces Philox-based random number generation support to TileLang, spanning Python API layer, C++ intrinsic registration, CUDA codegen integration, and device-side implementation, alongside comprehensive 1D and 2D test cases validating cross-implementation consistency. Changes
Sequence DiagramsequenceDiagram
participant PyUser as Python User
participant PyAPI as tilelang.language.rand()
participant Macro as _rand_parallel_impl()
participant TVMCall as T.call_intrin(philox_rand)
participant Codegen as CUDA Codegen
participant DevKernel as Device Kernel (philox_rand)
PyUser->>PyAPI: rand(buffer, seed, n_rounds)
activate PyAPI
PyAPI->>PyAPI: Validate shape (1D or 2D)
PyAPI->>PyAPI: Compute total_elems, block_m, block_n
PyAPI->>Macro: _rand_parallel_impl(buffer, seed, ...)
deactivate PyAPI
activate Macro
Macro->>TVMCall: T.call_intrin(philox_rand, ...)
deactivate Macro
TVMCall->>Codegen: Emit CUDA kernel call
activate Codegen
Codegen->>Codegen: Set need_random_h_ flag
Codegen->>Codegen: Emit extern call to tl::philox_rand
Codegen->>Codegen: Include random.h in Finish
deactivate Codegen
Codegen->>DevKernel: Execute on CUDA device
activate DevKernel
DevKernel->>DevKernel: Initialize Philox state from seed
DevKernel->>DevKernel: Run philox_impl_device (n_rounds)
DevKernel->>DevKernel: Convert uint32 → uniform float
DevKernel->>DevKernel: Write to output buffer
deactivate DevKernel
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used🧬 Code graph analysis (1)testing/python/language/test_tilelang_language_rand.py (6)
🪛 Ruff (0.14.3)testing/python/language/test_tilelang_language_rand.py27-27: Unused function argument: (ARG001) 50-50: Unused function argument: (ARG001) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
testing/python/language/test_rand.py (1)
9-21: Simplify indexing since only one block is used.Line 19 uses
bx * M + i, but since the kernel is launched with only 1 block (line 14),bxis always 0. The multiplication is unnecessary and potentially misleading.Apply this diff to simplify the indexing:
for i in T.Parallel(M): - A[bx * M + i] = rand_buffer[i] + A[i] = rand_buffer[i]tilelang/language/random.py (2)
39-47: Consider simplifying the uint32-to-float conversion.The current implementation reinterprets as signed int32 and handles negative values, which works but is more complex than necessary. A simpler approach would be to convert the uint32 directly to float and scale by 1/2^32.
Alternative implementation:
@T.macro def uint32_to_uniform_float(x: tir.PrimExpr) -> tir.PrimExpr: - assert x.dtype == 'uint32' or x.dtype == "int32", f"x.dtype {x.dtype} is not supported" - x_int32 = T.reinterpret('int32', x) - scale = tir.const(4.6566127342e-10, "float32") - - x_abs = T.if_then_else(x_int32 < 0, -x_int32 - 1, x_int32) - - return T.Cast("float32", x_abs) * scale + assert x.dtype == 'uint32', f"x.dtype {x.dtype} is not supported" + # Scale factor: 1 / 2^32 = 2.3283064365e-10 + scale = tir.const(2.3283064365e-10, "float32") + return T.Cast("float32", x) * scaleThis avoids the signed interpretation and conditional logic. If the current implementation has specific numerical properties you want to preserve, please document why.
50-62: Efficiency concern: Only 1 of 4 Philox outputs is used.The Philox algorithm produces four independent random values (c0, c1, c2, c3) per invocation, but only
c0is used. This wastes 75% of the generated entropy and is computationally inefficient.Consider optimizing by processing 4 elements per Philox call:
@T.macro def _rand_parallel_impl(buffer: T.Buffer, seed_lo, seed_hi, total_elems, n_rounds): # Process 4 elements at a time for i in T.Parallel((total_elems + 3) // 4): offset = T.Cast("uint32", i) offset_lo = offset offset_hi = tir.const(0, "uint32") c0, c1, c2, c3 = philox_impl(offset_lo, offset_hi, tir.const(0, "uint32"), tir.const(0, "uint32"), seed_lo, seed_hi, n_rounds) # Store up to 4 values for j in T.serial(4): idx = i * 4 + j if idx < total_elems: coords = index_to_coordinates(idx, buffer.shape) rand_val = [c0, c1, c2, c3][j] buffer[coords] = uint32_to_uniform_float(rand_val)This would reduce the number of Philox calls by 4×, significantly improving performance for large buffers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
testing/python/language/test_rand.py(1 hunks)tilelang/language/__init__.py(1 hunks)tilelang/language/random.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tilelang/language/__init__.py (1)
tilelang/language/random.py (1)
rand(65-74)
testing/python/language/test_rand.py (5)
tilelang/jit/kernel.py (1)
out_idx(461-462)tilelang/language/proxy.py (1)
Tensor(252-253)tilelang/language/allocate.py (1)
alloc_fragment(59-70)tilelang/language/random.py (1)
rand(65-74)tilelang/language/parallel.py (1)
Parallel(9-29)
tilelang/language/random.py (3)
tilelang/language/utils.py (1)
index_to_coordinates(91-110)tilelang/language/tir/op.py (2)
reinterpret(2006-2025)if_then_else(3014-3044)tilelang/language/parallel.py (1)
Parallel(9-29)
🪛 Ruff (0.14.3)
tilelang/language/__init__.py
95-95: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
testing/python/language/test_rand.py
25-25: Unused function argument: dtype
(ARG001)
tilelang/language/random.py
58-58: Unpacked variable c1 is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
58-58: Unpacked variable c2 is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
58-58: Unpacked variable c3 is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (5)
tilelang/language/__init__.py (1)
95-96: LGTM! Clean re-export of the newrandAPI.The import follows the established pattern in this file for exposing module-level APIs. The
noqa: F401directive is appropriate for re-exported symbols.testing/python/language/test_rand.py (1)
45-47: LGTM!The test function is properly guarded with
@tilelang.testing.requires_cudaand correctly invokes the test harness.tilelang/language/random.py (3)
6-8: LGTM! Correct implementation of high-bits multiplication.The function correctly computes the high 32 bits of a 64-bit product using widening multiplication and bit shifting.
11-36: LGTM! Correct Philox-4x32 implementation.The implementation follows the standard Philox-4x32 algorithm with correct constants and round functions. The key evolution and counter mixing are properly implemented.
65-74: LGTM! Clean public API implementation.The function correctly:
- Converts the seed to uint64 and splits it into 32-bit components
- Calculates the total number of elements from the buffer shape
- Delegates to the parallel implementation
| @triton.jit | ||
| def triton_rand_1d(X, M, seed, dtype: tl.constexpr): | ||
| pid = tl.program_id(0) | ||
| offset = pid * M + tl.arange(0, M) | ||
| rand = tl.rand(seed, offset) | ||
| tl.store(X + offset, rand, mask=offset < M) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused parameter and fix mask logic.
Two issues:
- The
dtypeparameter is declared but never used in the function body. - The mask
offset < Mis incorrect. Sinceoffset = pid * M + tl.arange(0, M), the values range frompid * Mtopid * M + M - 1, which are always >= M when pid > 0. The mask should check against a total size or be removed if the grid guarantees coverage.
Apply this diff:
@triton.jit
-def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
+def triton_rand_1d(X, M, seed, TOTAL_SIZE):
pid = tl.program_id(0)
offset = pid * M + tl.arange(0, M)
rand = tl.rand(seed, offset)
- tl.store(X + offset, rand, mask=offset < M)
+ tl.store(X + offset, rand, mask=offset < TOTAL_SIZE)Note: If the grid is always guaranteed to match the total size exactly, the mask can be omitted entirely.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @triton.jit | |
| def triton_rand_1d(X, M, seed, dtype: tl.constexpr): | |
| pid = tl.program_id(0) | |
| offset = pid * M + tl.arange(0, M) | |
| rand = tl.rand(seed, offset) | |
| tl.store(X + offset, rand, mask=offset < M) | |
| @triton.jit | |
| def triton_rand_1d(X, M, seed, TOTAL_SIZE): | |
| pid = tl.program_id(0) | |
| offset = pid * M + tl.arange(0, M) | |
| rand = tl.rand(seed, offset) | |
| tl.store(X + offset, rand, mask=offset < TOTAL_SIZE) |
🧰 Tools
🪛 Ruff (0.14.3)
25-25: Unused function argument: dtype
(ARG001)
🤖 Prompt for AI Agents
In testing/python/language/test_rand.py around lines 24 to 29, the
triton_rand_1d function declares an unused dtype parameter and uses an incorrect
mask; remove the unused dtype parameter from the signature and any callers, and
fix the mask by either (a) replacing "offset < M" with a check against the total
number of elements (e.g., "offset < total_size") after adding a total_size
argument, or (b) if the kernel grid is guaranteed to exactly cover the array,
remove the mask entirely; update any call sites accordingly to pass total_size
if you choose option (a).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
testing/python/language/test_tilelang_language_rand.py (2)
24-29: Remove unused parameter and fix mask logic.The issues flagged in the previous review remain:
- The
dtypeparameter is declared but never used- The mask
offset < Mis incorrect whenpid > 0since offset ranges frompid * Mtopid * M + M - 1
32-42: Fix Triton kernel invocation and remove unnecessary tensor allocation.The issues flagged in the previous review remain:
- Line 34 creates
tilelang_resultunnecessarily before reassignment on line 35- Line 39 uses
tl.constexpr(M)incorrectly—constexpris a type hint, not a runtime wrapper- Line 40 passes unused
dtypeparameter to the kernel
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
testing/python/language/test_tilelang_language_rand.py(1 hunks)tilelang/language/random.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/language/random.py (2)
tilelang/language/__init__.py (1)
import_source(98-100)tilelang/language/tir/op.py (2)
call_extern(173-195)address_of(464-480)
testing/python/language/test_tilelang_language_rand.py (3)
tilelang/language/allocate.py (1)
alloc_fragment(59-70)tilelang/language/random.py (1)
rand(92-97)tilelang/language/parallel.py (1)
Parallel(9-29)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py
25-25: Unused function argument: dtype
(ARG001)
🔇 Additional comments (6)
testing/python/language/test_tilelang_language_rand.py (3)
1-6: LGTM!The imports are appropriate for testing TileLang and Triton random number generation.
45-51: LGTM!The test structure is appropriate. The
@tilelang.testing.requires_cudadecorator correctly guards CUDA-dependent tests.
14-15: Review comment appears speculative without definitive evidence of a problem.The original comment raises reasonable architectural concerns about single-block kernels and fragment buffer sizes, but the search results do not reveal explicit size constraints in the tilelang codebase. The code under review is a test file with intentional simplifications—allocating a 1D fragment of size M for a random number fill is not unusual for test scenarios. The concern about "large M values" is not substantiated by error handling or documented limits found in the codebase.
Without concrete evidence that fragment buffers have enforced size limits (the allocate.py documentation in the search results does not specify such limits), and given this is a test rather than production code, the review comment represents a best-practice suggestion rather than a critical issue.
tilelang/language/random.py (3)
8-12: LGTM!The inline PTX assembly for computing the high 32 bits of unsigned multiplication is correct and efficient.
14-44: LGTM!The Philox algorithm implementation is correct. The constants and key schedule match the standard Philox specification.
92-97: LGTM!The public API correctly computes the total number of elements and delegates to the implementation macro.
tilelang/language/random.py
Outdated
| __device__ void philox_rand_kernel( | ||
| float* output, | ||
| int total_elems, | ||
| uint64_t seed, | ||
| int n_rounds | ||
| ) { | ||
| int idx = blockIdx.x * blockDim.x + threadIdx.x; | ||
| if (idx >= total_elems) return; | ||
| uint32_t seed_lo = (uint32_t)(seed & 0xFFFFFFFFULL); | ||
| uint32_t seed_hi = (uint32_t)((seed >> 32) & 0xFFFFFFFFULL); | ||
| uint32_t offset_lo = (uint32_t)idx; | ||
| uint32_t offset_hi = 0U; | ||
| uint32_t c0 = offset_lo; | ||
| uint32_t c1 = offset_hi; | ||
| uint32_t c2 = 0U; | ||
| uint32_t c3 = 0U; | ||
| philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds); | ||
| output[idx] = uint32_to_uniform_float_device(c0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider efficiency improvement and note dtype limitation.
Two observations:
-
Efficiency: The Philox algorithm generates 4 random uint32 values (c0, c1, c2, c3) per invocation, but only c0 is used (line 74). For better throughput, consider generating 4 output values per thread.
-
Hardcoded dtype: The kernel signature uses
float* output(line 57), which limits the implementation to float32. This conflicts with thedtypeparameter passed to_rand_parallel_impl(line 80), which suggests multi-dtype support.
For the efficiency improvement, you could modify the kernel to write 4 consecutive values:
- output[idx] = uint32_to_uniform_float_device(c0);
+ int base_idx = idx * 4;
+ if (base_idx < total_elems) output[base_idx] = uint32_to_uniform_float_device(c0);
+ if (base_idx + 1 < total_elems) output[base_idx + 1] = uint32_to_uniform_float_device(c1);
+ if (base_idx + 2 < total_elems) output[base_idx + 2] = uint32_to_uniform_float_device(c2);
+ if (base_idx + 3 < total_elems) output[base_idx + 3] = uint32_to_uniform_float_device(c3);Note: This would require adjusting the grid/block dimensions accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tilelang/language/random.py (1)
6-6: Drop or wire through the unuseddtypeparameter.
dtypenever leaves_rand_parallel_impl, so it’s confusing noise and will trigger lint. Either remove it or pass it through to the intrinsic once multi-dtype support lands.src/tl_templates/cuda/random.h (1)
33-36: Simplify: Remove unnecessary uint64_t casts.The uint64_t casts in these operations are unnecessary:
- Lines 33-34: Casting to uint64_t then truncating to uint32_t for the low 32 bits can be replaced with direct multiplication, which naturally truncates.
- Lines 35-36: Casting to uint64_t for uint32_t addition is unnecessary since uint32_t addition naturally wraps on overflow (which is the intended behavior for Philox).
Apply this diff to simplify the code:
- c1_val = (uint32_t)((uint64_t)B * (uint64_t)_c2); - c3_val = (uint32_t)((uint64_t)A * (uint64_t)_c0); - k0_val = (uint32_t)((uint64_t)k0_val + PHILOX_KEY_A); - k1_val = (uint32_t)((uint64_t)k1_val + PHILOX_KEY_B); + c1_val = B * _c2; + c3_val = A * _c0; + k0_val += PHILOX_KEY_A; + k1_val += PHILOX_KEY_B;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/op/builtin.cc(1 hunks)src/op/builtin.h(1 hunks)src/target/codegen_cuda.cc(2 hunks)src/target/codegen_cuda.h(1 hunks)src/tl_templates/cuda/random.h(1 hunks)tilelang/language/random.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-03T06:24:11.411Z
Learnt from: Rachmanino
Repo: tile-ai/tilelang PR: 1175
File: src/op/math.cc:44-52
Timestamp: 2025-11-03T06:24:11.411Z
Learning: In tilelang's `src/op/math.cc`, the `infinity_op` function uses `std::numeric_limits<float>::infinity()` as a placeholder for all float types (including float64 and bfloat16). The codegen layer (PrintConst:Inf) handles the correct infinity value based on the dtype field of the FloatImm node, so the specific C++ template argument doesn't matter.
Applied to files:
src/target/codegen_cuda.cc
🧬 Code graph analysis (2)
src/target/codegen_cuda.cc (1)
src/target/codegen_cuda.h (1)
need_random_h_(126-156)
tilelang/language/random.py (2)
src/op/builtin.h (1)
tvm(13-577)tilelang/language/tir/op.py (2)
call_intrin(120-145)address_of(464-480)
🪛 Ruff (0.14.3)
tilelang/language/random.py
6-6: Unused function argument: dtype
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (2)
src/tl_templates/cuda/random.h (2)
7-11: LGTM: Standard PTX multiply-high implementation.The inline PTX assembly correctly computes the high 32 bits of a 32×32→64 multiply, which is the standard approach for CUDA device code.
52-68: Kernel logic is correct, pending float conversion fix.The kernel implementation correctly:
- Computes per-thread indices with bounds checking
- Initializes Philox state using seed and thread index as offset
- Calls the Philox round function
- Writes the converted output
However, the output correctness depends on fixing the
uint32_to_uniform_float_devicefunction flagged above. Once that conversion is corrected, this kernel will produce properly distributed random floats.
| TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) { | ||
| const float scale = 4.6566127342e-10f; | ||
| int32_t x_int32; | ||
| memcpy(&x_int32, &x, sizeof(uint32_t)); | ||
| int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32; | ||
| return (float)x_abs * scale; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect uint32-to-float conversion logic.
The current implementation has a critical logic error that produces incorrect and biased random distributions:
- It reinterprets
uint32_tasint32_tand takes an absolute value, which is incorrect for values ≥ 2³¹ - The scale factor
4.6566127342e-10f≈ 1/2³¹ instead of the correct 1/2³² ≈2.3283064365e-10f - This causes uint32 values in [2³¹, 2³²) to wrap around and map back to [0, 1), creating significant bias
For example:
0x80000000→ interpreted as-2147483648→ abs gives2147483647→ ~1.00xFFFFFFFF→ interpreted as-1→ abs gives0→ ~0.0
The standard conversion for uint32 to uniform float [0, 1) is simply: x * 2.3283064365386963e-10f
Apply this diff to fix the conversion:
TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
- const float scale = 4.6566127342e-10f;
- int32_t x_int32;
- memcpy(&x_int32, &x, sizeof(uint32_t));
- int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
- return (float)x_abs * scale;
+ return x * 2.3283064365386963e-10f;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) { | |
| const float scale = 4.6566127342e-10f; | |
| int32_t x_int32; | |
| memcpy(&x_int32, &x, sizeof(uint32_t)); | |
| int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32; | |
| return (float)x_abs * scale; | |
| } | |
| TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) { | |
| return x * 2.3283064365386963e-10f; | |
| } |
🤖 Prompt for AI Agents
In src/tl_templates/cuda/random.h around lines 44 to 50, the uint32-to-float
conversion wrongly reinterprets the uint32 as int32, takes an abs (causing
wrap/bias for values >= 2^31) and uses the wrong scale; replace the body so it
does not cast/reinterpret to signed, remove memcpy and abs, and return the
uint32 multiplied by the correct 1/2^32 scale constant (2.3283064365386963e-10f)
cast to float so the function computes a uniform float in [0,1) without bias.
| T.call_intrin( | ||
| "handle", | ||
| tir.op.Op.get("tl.philox_rand"), | ||
| T.address_of(buffer[0]), | ||
| total_elems, | ||
| seed, | ||
| n_rounds, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap the intrinsic in T.evaluate.
As written, T.call_intrin just builds a PrimExpr. Without T.evaluate(...) the call never gets emitted into the TIR, so the random kernel is never invoked.
Apply this diff:
- T.call_intrin(
- "handle",
- tir.op.Op.get("tl.philox_rand"),
- T.address_of(buffer[0]),
- total_elems,
- seed,
- n_rounds,
- )
+ T.evaluate(
+ T.call_intrin(
+ "handle",
+ tir.op.Op.get("tl.philox_rand"),
+ T.address_of(buffer[0]),
+ total_elems,
+ seed,
+ n_rounds,
+ )
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| T.call_intrin( | |
| "handle", | |
| tir.op.Op.get("tl.philox_rand"), | |
| T.address_of(buffer[0]), | |
| total_elems, | |
| seed, | |
| n_rounds, | |
| ) | |
| T.evaluate( | |
| T.call_intrin( | |
| "handle", | |
| tir.op.Op.get("tl.philox_rand"), | |
| T.address_of(buffer[0]), | |
| total_elems, | |
| seed, | |
| n_rounds, | |
| ) | |
| ) |
🤖 Prompt for AI Agents
In tilelang/language/random.py around lines 7 to 14, the T.call_intrin
invocation builds a PrimExpr but is never emitted; wrap the intrinsic call in
T.evaluate so the call becomes a side-effecting statement in the TIR. Replace
the bare T.call_intrin(...) expression with T.evaluate(T.call_intrin(...)) (or
assign the result to a throwaway var via T.evaluate) so the random kernel is
actually invoked at codegen time.
tilelang/language/random.py
Outdated
| T.call_intrin( | ||
| "handle", | ||
| tir.op.Op.get("tl.philox_rand"), | ||
| T.address_of(buffer[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index all buffer dimensions when taking the address.
buffer[0] only works for 1D buffers; multi-dimensional buffers require one index per dimension, otherwise script lowering fails. Emit a zero-tuple so the API works for general shapes.
Apply this diff after the T.evaluate change:
- T.address_of(buffer[0]),
+ T.address_of(buffer[tuple(0 for _ in buffer.shape)]),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| T.address_of(buffer[0]), | |
| T.address_of(buffer[tuple(0 for _ in buffer.shape)]), |
🤖 Prompt for AI Agents
In tilelang/language/random.py around line 10, the code takes
address_of(buffer[0]) which only indexes 1D buffers; change it to index the
buffer with an empty/zero-length index tuple so all dimensions are handled
(construct an empty index tuple and use it to index the buffer before calling
T.address_of), applying this after the T.evaluate change so lowering works for
multi-dimensional buffers.
- but 2d has some problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
testing/python/language/test_tilelang_language_rand.py (2)
23-28: Address the unused parameter and mask logic issues identified in the previous review.As noted in the previous review, this function has two issues:
- The
dtypeparameter is declared but never used- The mask logic
offset < Mis misleading for multi-block scenariosPlease refer to the previous review comment for the suggested fix.
31-40: Fix dtype inconsistency and address issues from the previous review.In addition to the issues identified in the previous review (unused tensor allocation, incorrect
tl.constexprusage), there's a dtype inconsistency:
- Line 33 explicitly uses
torch.float32for the TileLang result (though it's immediately reassigned)- Line 36 uses
getattr(torch, dtype)for the Triton resultWhile line 34 reassigns
tilelang_result, this inconsistency could be confusing. Ensure both implementations use the same dtype.Apply this diff to address all issues:
def run_rand_1d(M=1024, seed=42, dtype="float32", device="cuda"): tilelang_kernel = tilelang_rand_1d(M=M, seed=seed, dtype=dtype) - tilelang_result = torch.empty(M, dtype=torch.float32, device=device) tilelang_result = tilelang_kernel() triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device) grid = (1,) - BLOCK = tl.constexpr(M) - triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype)) + triton_rand_1d[grid](triton_result, M, seed=seed, TOTAL_SIZE=M) torch.testing.assert_close(tilelang_result, triton_result, atol=1e-3, rtol=1e-3)
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_rand.py (1)
14-18: Consider simplifying the indexing for the single-block kernel.Since the kernel uses only 1 block (line 14),
bxwill always be 0, making the expressionbx * M + iequivalent to justi. For clarity, you can simplify the indexing.@T.prim_func def rand_kernel(A: T.Tensor((M,), dtype)): with T.Kernel(1, threads=1024) as bx: rand_buffer = T.alloc_fragment((M,), dtype) T.rand(rand_buffer, seed) for i in T.Parallel(M): - A[bx * M + i] = rand_buffer[i] + A[i] = rand_buffer[i]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/tl_templates/cuda/random.h(1 hunks)testing/python/language/test_tilelang_language_rand.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tl_templates/cuda/random.h
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_rand.py (6)
tilelang/language/allocate.py (2)
alloc_fragment(59-70)alloc_shared(27-42)tilelang/language/random.py (1)
rand(17-22)tilelang/language/parallel.py (1)
Parallel(9-29)tilelang/transform/pass_config.py (1)
PassConfigKey(6-144)tilelang/language/copy.py (1)
copy(11-87)tilelang/language/tir/op.py (1)
if_then_else(3014-3044)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py
24-24: Unused function argument: dtype
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (2)
testing/python/language/test_tilelang_language_rand.py (2)
96-109: Note: Different block sizes between Triton and TileLang implementations.The Triton kernel uses
BLOCK_SIZE=1024(line 104), while the TileLang kernel usesblock_m=64(line 71 intilelang_seeded_dropout). While this shouldn't affect correctness if both implementations are correct, it's worth noting that they process data in different granularities.Additionally, this test will likely fail due to the seed handling issue identified in the
tilelang_seeded_dropoutfunction, where all blocks use the same seed.
112-119: LGTM: Test structure is appropriate.The test function is properly decorated with
@tilelang.testing.requires_cudaand calls both test runners. The main guard follows standard Python conventions.Note that the test will need the issues identified in earlier comments to be resolved before it can pass reliably.
| output_buffer = T.alloc_fragment((block_m,), dtype) | ||
|
|
||
| T.copy(A[block_start], A_shared) | ||
| T.rand(rand_buffer, seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Critical: Same seed used across all blocks produces identical random sequences.
Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. Unlike the Triton reference (line 58), which passes element-specific offsets to ensure unique random numbers, this implementation will generate the same random sequence in each block. This breaks the dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.
Verify whether the TileLang rand implementation automatically accounts for block/thread IDs to generate unique sequences:
🏁 Script executed:
#!/bin/bash
# Check if _rand_parallel_impl uses block/thread context for offset
# Search for the implementation of _rand_parallel_impl
ast-grep --pattern 'def _rand_parallel_impl($$$)'
# Also search for how seed/offset is handled in random.py
rg -n -A 10 '_rand_parallel_impl'Length of output: 1226
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the definition of T.rand function
# Search in language module for rand function
rg -n 'def rand' --type=py
# Also check if it's an alias or imported
rg -n 'rand\s*=' --type=py -A 2 -B 2
# Search for rand_buffer allocation in test file
rg -n 'rand_buffer' --type=py -B 5 -A 5
Length of output: 3787
🏁 Script executed:
#!/bin/bash
# Also search more broadly for rand in the tilelang module
rg -n '\bT\.rand' --type=py -C 3Length of output: 1327
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the full definition of T.rand including docstring
rg -n 'def rand' --type=py -A 15 tilelang/language/random.py
Length of output: 209
Critical: Same seed used across all blocks produces identical random sequences.
Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. The T.rand function has no offset parameter to differentiate between blocks. Unlike the Triton reference (lines 23-28), which passes element-specific offsets to ensure unique random numbers, all blocks will generate identical random sequences. This breaks dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.
To fix: Modify T.rand to accept an offset parameter (like Triton), or compute different seeds/offsets per block using block IDs before calling T.rand.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_rand.py around line 82,
calling T.rand(rand_buffer, seed) with the same seed in every block produces
identical random sequences across blocks; change the call so each block uses a
distinct offset/seed: either update T.rand to accept an additional offset
parameter (like Triton) and pass a per-element/per-block offset computed from
block IDs, or compute a unique seed per block (e.g., seed + block_id or seed ^
block_id) before calling T.rand; ensure the offset/seed computation uses the
thread/block indices available in the kernel so different blocks generate
independent random streams.
|
It seems there are still some issues when the buffer is 2D... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (4)
tilelang/language/random.py (1)
13-21: Emit the intrinsic and supply a writable pointer.Line 13: Without wrapping the
T.call_intrininT.evaluate, the call is dropped during lowering, so the Philox generator never runs. Line 16:buffer.access_ptr("r")returns a read-only pointer, which cannot legally be passed to afloat*parameter and will fail compilation once CUDA codegen instantiates the call. Please wrap the intrinsic inT.evaluateand request a read/write pointer before passing it to the device helper.- T.call_intrin( - "handle", - tir.op.Op.get("tl.philox_rand"), - buffer.access_ptr("r"), - total_elems, - block_m, - block_n, - seed, - n_rounds, - ) + T.evaluate( + T.call_intrin( + "handle", + tir.op.Op.get("tl.philox_rand"), + buffer.access_ptr("rw"), + total_elems, + block_m, + block_n, + seed, + n_rounds, + ) + )src/tl_templates/cuda/random.h (1)
44-49: Correct the uint32→float conversion.Line 44: Reinterpreting the value as signed, taking an absolute value, and scaling by 1/2³¹ both introduces UB (on
INT32_MIN) and produces a biased distribution. The canonical mapping to[0, 1)is simplyx * 2.3283064365386963e-10f.TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) { - const float scale = 4.6566127342e-10f; - int32_t x_int32; - memcpy(&x_int32, &x, sizeof(uint32_t)); - int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32; - return (float)x_abs * scale; + return x * 2.3283064365386963e-10f; }testing/python/language/test_tilelang_language_rand.py (2)
24-28: Fix Triton mask and drop the unused dtype parameter.Lines 24-28: Using
mask=offset < Monly covers the first block; later blocks never store anything. Thedtypekernel argument is also unused. Please replace the mask with a check against the overall element count and remove the dead parameter.-@triton.jit -def triton_rand_1d(X, M, seed, dtype: tl.constexpr): +@triton.jit +def triton_rand_1d(X, M, seed, total_size): pid = tl.program_id(0) offset = pid * M + tl.arange(0, M) rand = tl.rand(seed, offset) - tl.store(X + offset, rand, mask=offset < M) + tl.store(X + offset, rand, mask=offset < total_size)
31-39: Update the Triton call site to pass the total size.Lines 31-39: After fixing the Triton kernel signature, the launch still passes the old arguments. Pass the total element count and drop the redundant
tl.constexprwrapper.- triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device) - grid = (1,) - BLOCK = tl.constexpr(M) - triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype)) + triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device) + grid = (1,) + triton_rand_1d[grid](triton_result, M, seed=seed, total_size=M)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/tl_templates/cuda/random.h(1 hunks)testing/python/language/test_tilelang_language_rand.py(1 hunks)tilelang/language/random.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
testing/python/language/test_tilelang_language_rand.py (4)
src/tl_templates/cuda/random.h (1)
tl(5-86)tilelang/language/allocate.py (2)
alloc_fragment(59-70)alloc_shared(27-42)tilelang/language/random.py (1)
rand(25-40)tilelang/language/tir/op.py (1)
if_then_else(3014-3044)
tilelang/language/random.py (2)
src/op/builtin.h (1)
tvm(13-577)tilelang/language/tir/op.py (1)
call_intrin(120-145)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py
24-24: Unused function argument: dtype
(ARG001)
114-114: Unused function argument: dtype
(ARG001)
tilelang/language/random.py
12-12: Unused function argument: dtype
(ARG001)
37-38: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
| int elems_per_thread = (total_elems + num_threads - 1) / num_threads; | ||
|
|
||
| for (int i = 0; i < elems_per_thread; i++) { | ||
| int local_linear_idx = tid * elems_per_thread + i; | ||
| int local_row = local_linear_idx / block_n; | ||
| int local_col = local_linear_idx % block_n; | ||
|
|
||
| int global_row = block_row_offset + local_row; | ||
| int global_col = block_col_offset + local_col; | ||
| uint64_t global_idx = (uint64_t)global_row * N + global_col; | ||
|
|
||
| uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL); | ||
| uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL); | ||
| uint32_t c2 = 0U; | ||
| uint32_t c3 = 0U; | ||
|
|
||
| philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds); | ||
| output[i] = uint32_to_uniform_float_device(c0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix overlapping writes and guard the tail.
Lines 65-83: Every thread writes to output[i], so threads clobber each other’s results. When total_elems < blockDim.x * elems_per_thread, local_linear_idx also exceeds the valid range, causing out-of-bounds writes. Guard the loop with local_linear_idx < total_elems and index the buffer with local_linear_idx.
- for (int i = 0; i < elems_per_thread; i++) {
- int local_linear_idx = tid * elems_per_thread + i;
+ for (int i = 0; i < elems_per_thread; i++) {
+ int local_linear_idx = tid * elems_per_thread + i;
+ if (local_linear_idx >= total_elems) {
+ break;
+ }
int local_row = local_linear_idx / block_n;
int local_col = local_linear_idx % block_n;
int global_row = block_row_offset + local_row;
int global_col = block_col_offset + local_col;
uint64_t global_idx = (uint64_t)global_row * N + global_col;
uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
uint32_t c2 = 0U;
uint32_t c3 = 0U;
philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
- output[i] = uint32_to_uniform_float_device(c0);
+ output[local_linear_idx] = uint32_to_uniform_float_device(c0);
}🤖 Prompt for AI Agents
In src/tl_templates/cuda/random.h around lines 65 to 83, threads currently write
to output[i] unguarded which causes overlapping writes and potential
out-of-bounds when local_linear_idx >= total_elems; modify the loop to check
that local_linear_idx < total_elems before doing work and write into
output[local_linear_idx] (not output[i]) so each thread writes to the correct
global slot and the tail is safely guarded.
| T.copy(A[block_start], A_shared) | ||
| T.rand(rand_buffer, seed) | ||
|
|
||
| for i in T.Parallel(block_m): | ||
| output_buffer[i] = T.if_then_else( | ||
| rand_buffer[i] > p, | ||
| A_shared[i] / (1.0 - p), | ||
| 0.0, | ||
| ) | ||
|
|
||
| T.copy(output_buffer, B[block_start]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard loads/stores in the last dropout block.
Lines 81-91: T.copy(A[block_start], A_shared) and T.copy(output_buffer, B[block_start]) always move block_m elements. When M % block_m != 0, the final block reads and writes past the end of A/B, leading to undefined behaviour. Please compute the remaining element count (e.g. valid = T.min(block_m, M - block_start)) and guard both copies and the parallel loop so the tail block only touches valid indices.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_rand.py around lines 81 to 91,
the copies and the parallel loop always move block_m elements which overruns A/B
when M % block_m != 0; compute valid = T.min(block_m, M - block_start) and use
that to bound all tail operations: copy only valid elements from A to A_shared
and from output_buffer back to B, and change the T.Parallel loop to iterate over
valid instead of block_m so rand_buffer/output_buffer and the conditional write
only touch indices < valid.
| with T.Kernel(T.ceildiv(M, block_m), T.ceildiv(N, block_n), threads=64) as (bx, by): | ||
| rand_buffer = T.alloc_fragment((block_m, block_n), dtype) | ||
| T.rand(rand_buffer, seed) | ||
| T.copy(rand_buffer, A[bx * block_m, by * block_n]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle partial tiles in the 2D generator.
Lines 138-142: T.copy(rand_buffer, A[bx * block_m, by * block_n]) blindly writes block_m * block_n elements. For matrices whose dimensions aren’t multiples of block_m/block_n, the last tiles run out of bounds. Add per-dimension guards (similar to the Triton mask) or clamp the copy extents to the remaining rows/columns before issuing the copy.
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| TIR_DEFINE_TL_BUILTIN(philox_rand) | ||
| .set_num_inputs(4) | ||
| .set_attr<TCallEffectKind>("TCallEffectKind", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register philox_rand with correct input count
The new intrinsic is registered with set_num_inputs(4) even though every caller passes six arguments (buffer pointer, total_elems, block_m, block_n, seed, n_rounds) and the CUDA helper tl::philox_rand also takes six parameters. TVM’s intrinsic registry enforces num_inputs, so this mismatch will trigger a check failure when T.rand is lowered, making the feature unusable. The intrinsic should accept six inputs (or -1 for variadic) to match the call sites.
Useful? React with 👍 / 👎.
src/tl_templates/cuda/random.h
Outdated
| int elems_per_thread = (total_elems + num_threads - 1) / num_threads; | ||
|
|
||
| for (int i = 0; i < elems_per_thread; i++) { | ||
| int local_linear_idx = tid * elems_per_thread + i; | ||
| int local_row = local_linear_idx / block_n; | ||
| int local_col = local_linear_idx % block_n; | ||
|
|
||
| int global_row = block_row_offset + local_row; | ||
| int global_col = block_col_offset + local_col; | ||
| uint64_t global_idx = (uint64_t)global_row * N + global_col; | ||
|
|
||
| uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL); | ||
| uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL); | ||
| uint32_t c2 = 0U; | ||
| uint32_t c3 = 0U; | ||
|
|
||
| philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds); | ||
| output[i] = uint32_to_uniform_float_device(c0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store Philox results into unique buffer indices
Inside the CUDA helper philox_rand each thread computes local_linear_idx for the element it owns, but the generated random value is written to output[i] instead of output[local_linear_idx]. When more than one thread participates this causes every thread to overwrite the same first few positions, leaving most of the buffer uninitialized and yielding repeated numbers when copied back to global memory. It can also write past the end of the buffer when total_elems is not a multiple of the thread count. The output pointer should be indexed by local_linear_idx (with a bounds check) so each thread writes to its own slice.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Cunxiao2002
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will take a look.
Summary by CodeRabbit
New Features
Tests