Skip to content

Conversation

@Cunxiao2002
Copy link
Contributor

@Cunxiao2002 Cunxiao2002 commented Nov 5, 2025

Summary by CodeRabbit

  • New Features

    • Added random number generation support for tensor buffers, enabling 1D and 2D buffer population with Philox-based random values.
  • Tests

    • Added comprehensive test suite validating random number generation across multiple implementations and configurations with numerical accuracy checks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The PR introduces Philox-based random number generation support to TileLang, spanning Python API layer, C++ intrinsic registration, CUDA codegen integration, and device-side implementation, alongside comprehensive 1D and 2D test cases validating cross-implementation consistency.

Changes

Cohort / File(s) Summary
Python API & Tests
tilelang/language/__init__.py, tilelang/language/random.py
Added public rand() function for Philox RNG with shape validation; internal _rand_parallel_impl() macro wrapper; re-exported rand at package level.
Test Suite
testing/python/language/test_tilelang_language_rand.py
New test module with 1D/2D RNG kernels, cross-validation against Triton and Torch, seeded dropout path, and CUDA-gated test execution.
C++ Builtin Infrastructure
src/op/builtin.h, src/op/builtin.cc
Added philox_rand() intrinsic declaration and TL builtin registration with 6-input signature and opaque call effect.
CUDA Codegen
src/target/codegen_cuda.h, src/target/codegen_cuda.cc
Introduced need_random_h_ flag; integrated philox_rand intrinsic handling into CallNode emission and Finish phase, including conditional random.h header emission.
CUDA Device Implementation
src/tl_templates/cuda/random.h
Added Philox RNG kernel (philox_rand), helper functions (umulhi_uint32, philox_impl_device, uint32_to_uniform_float_device), with per-thread work distribution and bounds checking.

Sequence Diagram

sequenceDiagram
    participant PyUser as Python User
    participant PyAPI as tilelang.language.rand()
    participant Macro as _rand_parallel_impl()
    participant TVMCall as T.call_intrin(philox_rand)
    participant Codegen as CUDA Codegen
    participant DevKernel as Device Kernel (philox_rand)
    
    PyUser->>PyAPI: rand(buffer, seed, n_rounds)
    activate PyAPI
    PyAPI->>PyAPI: Validate shape (1D or 2D)
    PyAPI->>PyAPI: Compute total_elems, block_m, block_n
    PyAPI->>Macro: _rand_parallel_impl(buffer, seed, ...)
    deactivate PyAPI
    
    activate Macro
    Macro->>TVMCall: T.call_intrin(philox_rand, ...)
    deactivate Macro
    
    TVMCall->>Codegen: Emit CUDA kernel call
    activate Codegen
    Codegen->>Codegen: Set need_random_h_ flag
    Codegen->>Codegen: Emit extern call to tl::philox_rand
    Codegen->>Codegen: Include random.h in Finish
    deactivate Codegen
    
    Codegen->>DevKernel: Execute on CUDA device
    activate DevKernel
    DevKernel->>DevKernel: Initialize Philox state from seed
    DevKernel->>DevKernel: Run philox_impl_device (n_rounds)
    DevKernel->>DevKernel: Convert uint32 → uniform float
    DevKernel->>DevKernel: Write to output buffer
    deactivate DevKernel
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • src/tl_templates/cuda/random.h: Device-side Philox implementation with bit-manipulation logic (umulhi_uint32, state transforms) requires verification of cryptographic soundness and numerical correctness.
  • tilelang/language/random.py: Shape validation logic and parameter derivation (block_m, block_n calculation) merit careful review.
  • src/target/codegen_cuda.cc/h: Integration points with existing codegen patterns; verify flag lifecycle and header inclusion correctness.
  • testing/python/language/test_tilelang_language_rand.py: Cross-implementation validation logic (1D, 2D, seeded dropout) and tolerance thresholds (atol/rtol ~1e-3) warrant scrutiny for floating-point equivalence claims.

Poem

🐰 A Philox of seeds we now shall sow,
With random bits in CUDA's glow,
Across the tensors, fast and free,
From one-dee to two-dee spree,
Round and round the cryptos go! 🎲

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.74% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding random number generation capability. It is concise, clear, and accurately reflects the primary purpose of the pull request across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebd64c3 and e381855.

📒 Files selected for processing (3)
  • src/op/builtin.cc (1 hunks)
  • src/tl_templates/cuda/random.h (1 hunks)
  • testing/python/language/test_tilelang_language_rand.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_rand.py (6)
src/tl_templates/cuda/random.h (1)
  • tl (5-93)
tilelang/env.py (1)
  • disable_cache (271-272)
tilelang/jit/kernel.py (1)
  • out_idx (461-462)
tilelang/language/allocate.py (1)
  • alloc_fragment (59-70)
tilelang/language/random.py (1)
  • rand (25-40)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py

27-27: Unused function argument: dtype

(ARG001)


50-50: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@Cunxiao2002 Cunxiao2002 marked this pull request as draft November 5, 2025 12:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
testing/python/language/test_rand.py (1)

9-21: Simplify indexing since only one block is used.

Line 19 uses bx * M + i, but since the kernel is launched with only 1 block (line 14), bx is always 0. The multiplication is unnecessary and potentially misleading.

Apply this diff to simplify the indexing:

            for i in T.Parallel(M):
-                A[bx * M + i] = rand_buffer[i]
+                A[i] = rand_buffer[i]
tilelang/language/random.py (2)

39-47: Consider simplifying the uint32-to-float conversion.

The current implementation reinterprets as signed int32 and handles negative values, which works but is more complex than necessary. A simpler approach would be to convert the uint32 directly to float and scale by 1/2^32.

Alternative implementation:

 @T.macro
 def uint32_to_uniform_float(x: tir.PrimExpr) -> tir.PrimExpr:
-    assert x.dtype == 'uint32' or x.dtype == "int32", f"x.dtype {x.dtype} is not supported"
-    x_int32 = T.reinterpret('int32', x)
-    scale = tir.const(4.6566127342e-10, "float32")
-
-    x_abs = T.if_then_else(x_int32 < 0, -x_int32 - 1, x_int32)
-
-    return T.Cast("float32", x_abs) * scale
+    assert x.dtype == 'uint32', f"x.dtype {x.dtype} is not supported"
+    # Scale factor: 1 / 2^32 = 2.3283064365e-10
+    scale = tir.const(2.3283064365e-10, "float32")
+    return T.Cast("float32", x) * scale

This avoids the signed interpretation and conditional logic. If the current implementation has specific numerical properties you want to preserve, please document why.


50-62: Efficiency concern: Only 1 of 4 Philox outputs is used.

The Philox algorithm produces four independent random values (c0, c1, c2, c3) per invocation, but only c0 is used. This wastes 75% of the generated entropy and is computationally inefficient.

Consider optimizing by processing 4 elements per Philox call:

@T.macro
def _rand_parallel_impl(buffer: T.Buffer, seed_lo, seed_hi, total_elems, n_rounds):
    # Process 4 elements at a time
    for i in T.Parallel((total_elems + 3) // 4):
        offset = T.Cast("uint32", i)
        offset_lo = offset
        offset_hi = tir.const(0, "uint32")
        
        c0, c1, c2, c3 = philox_impl(offset_lo, offset_hi, tir.const(0, "uint32"),
                                     tir.const(0, "uint32"), seed_lo, seed_hi, n_rounds)
        
        # Store up to 4 values
        for j in T.serial(4):
            idx = i * 4 + j
            if idx < total_elems:
                coords = index_to_coordinates(idx, buffer.shape)
                rand_val = [c0, c1, c2, c3][j]
                buffer[coords] = uint32_to_uniform_float(rand_val)

This would reduce the number of Philox calls by 4×, significantly improving performance for large buffers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9d823b and a8d70a2.

📒 Files selected for processing (3)
  • testing/python/language/test_rand.py (1 hunks)
  • tilelang/language/__init__.py (1 hunks)
  • tilelang/language/random.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tilelang/language/__init__.py (1)
tilelang/language/random.py (1)
  • rand (65-74)
testing/python/language/test_rand.py (5)
tilelang/jit/kernel.py (1)
  • out_idx (461-462)
tilelang/language/proxy.py (1)
  • Tensor (252-253)
tilelang/language/allocate.py (1)
  • alloc_fragment (59-70)
tilelang/language/random.py (1)
  • rand (65-74)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
tilelang/language/random.py (3)
tilelang/language/utils.py (1)
  • index_to_coordinates (91-110)
tilelang/language/tir/op.py (2)
  • reinterpret (2006-2025)
  • if_then_else (3014-3044)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
🪛 Ruff (0.14.3)
tilelang/language/__init__.py

95-95: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

testing/python/language/test_rand.py

25-25: Unused function argument: dtype

(ARG001)

tilelang/language/random.py

58-58: Unpacked variable c1 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


58-58: Unpacked variable c2 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


58-58: Unpacked variable c3 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (5)
tilelang/language/__init__.py (1)

95-96: LGTM! Clean re-export of the new rand API.

The import follows the established pattern in this file for exposing module-level APIs. The noqa: F401 directive is appropriate for re-exported symbols.

testing/python/language/test_rand.py (1)

45-47: LGTM!

The test function is properly guarded with @tilelang.testing.requires_cuda and correctly invokes the test harness.

tilelang/language/random.py (3)

6-8: LGTM! Correct implementation of high-bits multiplication.

The function correctly computes the high 32 bits of a 64-bit product using widening multiplication and bit shifting.


11-36: LGTM! Correct Philox-4x32 implementation.

The implementation follows the standard Philox-4x32 algorithm with correct constants and round functions. The key evolution and counter mixing are properly implemented.


65-74: LGTM! Clean public API implementation.

The function correctly:

  • Converts the seed to uint64 and splits it into 32-bit components
  • Calculates the total number of elements from the buffer shape
  • Delegates to the parallel implementation

Comment on lines +24 to +29
@triton.jit
def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
pid = tl.program_id(0)
offset = pid * M + tl.arange(0, M)
rand = tl.rand(seed, offset)
tl.store(X + offset, rand, mask=offset < M)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove unused parameter and fix mask logic.

Two issues:

  1. The dtype parameter is declared but never used in the function body.
  2. The mask offset < M is incorrect. Since offset = pid * M + tl.arange(0, M), the values range from pid * M to pid * M + M - 1, which are always >= M when pid > 0. The mask should check against a total size or be removed if the grid guarantees coverage.

Apply this diff:

 @triton.jit
-def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
+def triton_rand_1d(X, M, seed, TOTAL_SIZE):
     pid = tl.program_id(0)
     offset = pid * M + tl.arange(0, M)
     rand = tl.rand(seed, offset)
-    tl.store(X + offset, rand, mask=offset < M)
+    tl.store(X + offset, rand, mask=offset < TOTAL_SIZE)

Note: If the grid is always guaranteed to match the total size exactly, the mask can be omitted entirely.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@triton.jit
def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
pid = tl.program_id(0)
offset = pid * M + tl.arange(0, M)
rand = tl.rand(seed, offset)
tl.store(X + offset, rand, mask=offset < M)
@triton.jit
def triton_rand_1d(X, M, seed, TOTAL_SIZE):
pid = tl.program_id(0)
offset = pid * M + tl.arange(0, M)
rand = tl.rand(seed, offset)
tl.store(X + offset, rand, mask=offset < TOTAL_SIZE)
🧰 Tools
🪛 Ruff (0.14.3)

25-25: Unused function argument: dtype

(ARG001)

🤖 Prompt for AI Agents
In testing/python/language/test_rand.py around lines 24 to 29, the
triton_rand_1d function declares an unused dtype parameter and uses an incorrect
mask; remove the unused dtype parameter from the signature and any callers, and
fix the mask by either (a) replacing "offset < M" with a check against the total
number of elements (e.g., "offset < total_size") after adding a total_size
argument, or (b) if the kernel grid is guaranteed to exactly cover the array,
remove the mask entirely; update any call sites accordingly to pass total_size
if you choose option (a).

@LeiWang1999 LeiWang1999 marked this pull request as ready for review November 6, 2025 09:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
testing/python/language/test_tilelang_language_rand.py (2)

24-29: Remove unused parameter and fix mask logic.

The issues flagged in the previous review remain:

  1. The dtype parameter is declared but never used
  2. The mask offset < M is incorrect when pid > 0 since offset ranges from pid * M to pid * M + M - 1

32-42: Fix Triton kernel invocation and remove unnecessary tensor allocation.

The issues flagged in the previous review remain:

  1. Line 34 creates tilelang_result unnecessarily before reassignment on line 35
  2. Line 39 uses tl.constexpr(M) incorrectly—constexpr is a type hint, not a runtime wrapper
  3. Line 40 passes unused dtype parameter to the kernel
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8d70a2 and 760536a.

📒 Files selected for processing (2)
  • testing/python/language/test_tilelang_language_rand.py (1 hunks)
  • tilelang/language/random.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tilelang/language/random.py (2)
tilelang/language/__init__.py (1)
  • import_source (98-100)
tilelang/language/tir/op.py (2)
  • call_extern (173-195)
  • address_of (464-480)
testing/python/language/test_tilelang_language_rand.py (3)
tilelang/language/allocate.py (1)
  • alloc_fragment (59-70)
tilelang/language/random.py (1)
  • rand (92-97)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py

25-25: Unused function argument: dtype

(ARG001)

🔇 Additional comments (6)
testing/python/language/test_tilelang_language_rand.py (3)

1-6: LGTM!

The imports are appropriate for testing TileLang and Triton random number generation.


45-51: LGTM!

The test structure is appropriate. The @tilelang.testing.requires_cuda decorator correctly guards CUDA-dependent tests.


14-15: Review comment appears speculative without definitive evidence of a problem.

The original comment raises reasonable architectural concerns about single-block kernels and fragment buffer sizes, but the search results do not reveal explicit size constraints in the tilelang codebase. The code under review is a test file with intentional simplifications—allocating a 1D fragment of size M for a random number fill is not unusual for test scenarios. The concern about "large M values" is not substantiated by error handling or documented limits found in the codebase.

Without concrete evidence that fragment buffers have enforced size limits (the allocate.py documentation in the search results does not specify such limits), and given this is a test rather than production code, the review comment represents a best-practice suggestion rather than a critical issue.

tilelang/language/random.py (3)

8-12: LGTM!

The inline PTX assembly for computing the high 32 bits of unsigned multiplication is correct and efficient.


14-44: LGTM!

The Philox algorithm implementation is correct. The constants and key schedule match the standard Philox specification.


92-97: LGTM!

The public API correctly computes the total number of elements and delegates to the implementation macro.

Comment on lines 56 to 75
__device__ void philox_rand_kernel(
float* output,
int total_elems,
uint64_t seed,
int n_rounds
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= total_elems) return;
uint32_t seed_lo = (uint32_t)(seed & 0xFFFFFFFFULL);
uint32_t seed_hi = (uint32_t)((seed >> 32) & 0xFFFFFFFFULL);
uint32_t offset_lo = (uint32_t)idx;
uint32_t offset_hi = 0U;
uint32_t c0 = offset_lo;
uint32_t c1 = offset_hi;
uint32_t c2 = 0U;
uint32_t c3 = 0U;
philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
output[idx] = uint32_to_uniform_float_device(c0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider efficiency improvement and note dtype limitation.

Two observations:

  1. Efficiency: The Philox algorithm generates 4 random uint32 values (c0, c1, c2, c3) per invocation, but only c0 is used (line 74). For better throughput, consider generating 4 output values per thread.

  2. Hardcoded dtype: The kernel signature uses float* output (line 57), which limits the implementation to float32. This conflicts with the dtype parameter passed to _rand_parallel_impl (line 80), which suggests multi-dtype support.

For the efficiency improvement, you could modify the kernel to write 4 consecutive values:

-    output[idx] = uint32_to_uniform_float_device(c0);
+    int base_idx = idx * 4;
+    if (base_idx < total_elems) output[base_idx] = uint32_to_uniform_float_device(c0);
+    if (base_idx + 1 < total_elems) output[base_idx + 1] = uint32_to_uniform_float_device(c1);
+    if (base_idx + 2 < total_elems) output[base_idx + 2] = uint32_to_uniform_float_device(c2);
+    if (base_idx + 3 < total_elems) output[base_idx + 3] = uint32_to_uniform_float_device(c3);

Note: This would require adjusting the grid/block dimensions accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tilelang/language/random.py (1)

6-6: Drop or wire through the unused dtype parameter.

dtype never leaves _rand_parallel_impl, so it’s confusing noise and will trigger lint. Either remove it or pass it through to the intrinsic once multi-dtype support lands.

src/tl_templates/cuda/random.h (1)

33-36: Simplify: Remove unnecessary uint64_t casts.

The uint64_t casts in these operations are unnecessary:

  • Lines 33-34: Casting to uint64_t then truncating to uint32_t for the low 32 bits can be replaced with direct multiplication, which naturally truncates.
  • Lines 35-36: Casting to uint64_t for uint32_t addition is unnecessary since uint32_t addition naturally wraps on overflow (which is the intended behavior for Philox).

Apply this diff to simplify the code:

-    c1_val = (uint32_t)((uint64_t)B * (uint64_t)_c2);
-    c3_val = (uint32_t)((uint64_t)A * (uint64_t)_c0);
-    k0_val = (uint32_t)((uint64_t)k0_val + PHILOX_KEY_A);
-    k1_val = (uint32_t)((uint64_t)k1_val + PHILOX_KEY_B);
+    c1_val = B * _c2;
+    c3_val = A * _c0;
+    k0_val += PHILOX_KEY_A;
+    k1_val += PHILOX_KEY_B;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 760536a and d63dfbb.

📒 Files selected for processing (6)
  • src/op/builtin.cc (1 hunks)
  • src/op/builtin.h (1 hunks)
  • src/target/codegen_cuda.cc (2 hunks)
  • src/target/codegen_cuda.h (1 hunks)
  • src/tl_templates/cuda/random.h (1 hunks)
  • tilelang/language/random.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-03T06:24:11.411Z
Learnt from: Rachmanino
Repo: tile-ai/tilelang PR: 1175
File: src/op/math.cc:44-52
Timestamp: 2025-11-03T06:24:11.411Z
Learning: In tilelang's `src/op/math.cc`, the `infinity_op` function uses `std::numeric_limits<float>::infinity()` as a placeholder for all float types (including float64 and bfloat16). The codegen layer (PrintConst:Inf) handles the correct infinity value based on the dtype field of the FloatImm node, so the specific C++ template argument doesn't matter.

Applied to files:

  • src/target/codegen_cuda.cc
🧬 Code graph analysis (2)
src/target/codegen_cuda.cc (1)
src/target/codegen_cuda.h (1)
  • need_random_h_ (126-156)
tilelang/language/random.py (2)
src/op/builtin.h (1)
  • tvm (13-577)
tilelang/language/tir/op.py (2)
  • call_intrin (120-145)
  • address_of (464-480)
🪛 Ruff (0.14.3)
tilelang/language/random.py

6-6: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (2)
src/tl_templates/cuda/random.h (2)

7-11: LGTM: Standard PTX multiply-high implementation.

The inline PTX assembly correctly computes the high 32 bits of a 32×32→64 multiply, which is the standard approach for CUDA device code.


52-68: Kernel logic is correct, pending float conversion fix.

The kernel implementation correctly:

  • Computes per-thread indices with bounds checking
  • Initializes Philox state using seed and thread index as offset
  • Calls the Philox round function
  • Writes the converted output

However, the output correctness depends on fixing the uint32_to_uniform_float_device function flagged above. Once that conversion is corrected, this kernel will produce properly distributed random floats.

Comment on lines +44 to +50
TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
const float scale = 4.6566127342e-10f;
int32_t x_int32;
memcpy(&x_int32, &x, sizeof(uint32_t));
int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
return (float)x_abs * scale;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix incorrect uint32-to-float conversion logic.

The current implementation has a critical logic error that produces incorrect and biased random distributions:

  1. It reinterprets uint32_t as int32_t and takes an absolute value, which is incorrect for values ≥ 2³¹
  2. The scale factor 4.6566127342e-10f ≈ 1/2³¹ instead of the correct 1/2³² ≈ 2.3283064365e-10f
  3. This causes uint32 values in [2³¹, 2³²) to wrap around and map back to [0, 1), creating significant bias

For example:

  • 0x80000000 → interpreted as -2147483648 → abs gives 2147483647 → ~1.0
  • 0xFFFFFFFF → interpreted as -1 → abs gives 0 → ~0.0

The standard conversion for uint32 to uniform float [0, 1) is simply: x * 2.3283064365386963e-10f

Apply this diff to fix the conversion:

 TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
-  const float scale = 4.6566127342e-10f;
-  int32_t x_int32;
-  memcpy(&x_int32, &x, sizeof(uint32_t));
-  int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
-  return (float)x_abs * scale;
+  return x * 2.3283064365386963e-10f;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
const float scale = 4.6566127342e-10f;
int32_t x_int32;
memcpy(&x_int32, &x, sizeof(uint32_t));
int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
return (float)x_abs * scale;
}
TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
return x * 2.3283064365386963e-10f;
}
🤖 Prompt for AI Agents
In src/tl_templates/cuda/random.h around lines 44 to 50, the uint32-to-float
conversion wrongly reinterprets the uint32 as int32, takes an abs (causing
wrap/bias for values >= 2^31) and uses the wrong scale; replace the body so it
does not cast/reinterpret to signed, remove memcpy and abs, and return the
uint32 multiplied by the correct 1/2^32 scale constant (2.3283064365386963e-10f)
cast to float so the function computes a uniform float in [0,1) without bias.

Comment on lines 7 to 14
T.call_intrin(
"handle",
tir.op.Op.get("tl.philox_rand"),
T.address_of(buffer[0]),
total_elems,
seed,
n_rounds,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Wrap the intrinsic in T.evaluate.

As written, T.call_intrin just builds a PrimExpr. Without T.evaluate(...) the call never gets emitted into the TIR, so the random kernel is never invoked.

Apply this diff:

-    T.call_intrin(
-        "handle",
-        tir.op.Op.get("tl.philox_rand"),
-        T.address_of(buffer[0]),
-        total_elems,
-        seed,
-        n_rounds,
-    )
+    T.evaluate(
+        T.call_intrin(
+            "handle",
+            tir.op.Op.get("tl.philox_rand"),
+            T.address_of(buffer[0]),
+            total_elems,
+            seed,
+            n_rounds,
+        )
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
T.call_intrin(
"handle",
tir.op.Op.get("tl.philox_rand"),
T.address_of(buffer[0]),
total_elems,
seed,
n_rounds,
)
T.evaluate(
T.call_intrin(
"handle",
tir.op.Op.get("tl.philox_rand"),
T.address_of(buffer[0]),
total_elems,
seed,
n_rounds,
)
)
🤖 Prompt for AI Agents
In tilelang/language/random.py around lines 7 to 14, the T.call_intrin
invocation builds a PrimExpr but is never emitted; wrap the intrinsic call in
T.evaluate so the call becomes a side-effecting statement in the TIR. Replace
the bare T.call_intrin(...) expression with T.evaluate(T.call_intrin(...)) (or
assign the result to a throwaway var via T.evaluate) so the random kernel is
actually invoked at codegen time.

T.call_intrin(
"handle",
tir.op.Op.get("tl.philox_rand"),
T.address_of(buffer[0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Index all buffer dimensions when taking the address.

buffer[0] only works for 1D buffers; multi-dimensional buffers require one index per dimension, otherwise script lowering fails. Emit a zero-tuple so the API works for general shapes.

Apply this diff after the T.evaluate change:

-            T.address_of(buffer[0]),
+            T.address_of(buffer[tuple(0 for _ in buffer.shape)]),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
T.address_of(buffer[0]),
T.address_of(buffer[tuple(0 for _ in buffer.shape)]),
🤖 Prompt for AI Agents
In tilelang/language/random.py around line 10, the code takes
address_of(buffer[0]) which only indexes 1D buffers; change it to index the
buffer with an empty/zero-length index tuple so all dimensions are handled
(construct an empty index tuple and use it to index the buffer before calling
T.address_of), applying this after the T.evaluate change so lowering works for
multi-dimensional buffers.

- but 2d has some problems
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
testing/python/language/test_tilelang_language_rand.py (2)

23-28: Address the unused parameter and mask logic issues identified in the previous review.

As noted in the previous review, this function has two issues:

  1. The dtype parameter is declared but never used
  2. The mask logic offset < M is misleading for multi-block scenarios

Please refer to the previous review comment for the suggested fix.


31-40: Fix dtype inconsistency and address issues from the previous review.

In addition to the issues identified in the previous review (unused tensor allocation, incorrect tl.constexpr usage), there's a dtype inconsistency:

  • Line 33 explicitly uses torch.float32 for the TileLang result (though it's immediately reassigned)
  • Line 36 uses getattr(torch, dtype) for the Triton result

While line 34 reassigns tilelang_result, this inconsistency could be confusing. Ensure both implementations use the same dtype.

Apply this diff to address all issues:

 def run_rand_1d(M=1024, seed=42, dtype="float32", device="cuda"):
     tilelang_kernel = tilelang_rand_1d(M=M, seed=seed, dtype=dtype)
-    tilelang_result = torch.empty(M, dtype=torch.float32, device=device)
     tilelang_result = tilelang_kernel()

     triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
     grid = (1,)
-    BLOCK = tl.constexpr(M)
-    triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype))
+    triton_rand_1d[grid](triton_result, M, seed=seed, TOTAL_SIZE=M)
     torch.testing.assert_close(tilelang_result, triton_result, atol=1e-3, rtol=1e-3)
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_rand.py (1)

14-18: Consider simplifying the indexing for the single-block kernel.

Since the kernel uses only 1 block (line 14), bx will always be 0, making the expression bx * M + i equivalent to just i. For clarity, you can simplify the indexing.

     @T.prim_func
     def rand_kernel(A: T.Tensor((M,), dtype)):
         with T.Kernel(1, threads=1024) as bx:
             rand_buffer = T.alloc_fragment((M,), dtype)
             T.rand(rand_buffer, seed)
             for i in T.Parallel(M):
-                A[bx * M + i] = rand_buffer[i]
+                A[i] = rand_buffer[i]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d63dfbb and 2a9dcec.

📒 Files selected for processing (2)
  • src/tl_templates/cuda/random.h (1 hunks)
  • testing/python/language/test_tilelang_language_rand.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/tl_templates/cuda/random.h
🧰 Additional context used
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_rand.py (6)
tilelang/language/allocate.py (2)
  • alloc_fragment (59-70)
  • alloc_shared (27-42)
tilelang/language/random.py (1)
  • rand (17-22)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
tilelang/transform/pass_config.py (1)
  • PassConfigKey (6-144)
tilelang/language/copy.py (1)
  • copy (11-87)
tilelang/language/tir/op.py (1)
  • if_then_else (3014-3044)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py

24-24: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (2)
testing/python/language/test_tilelang_language_rand.py (2)

96-109: Note: Different block sizes between Triton and TileLang implementations.

The Triton kernel uses BLOCK_SIZE=1024 (line 104), while the TileLang kernel uses block_m=64 (line 71 in tilelang_seeded_dropout). While this shouldn't affect correctness if both implementations are correct, it's worth noting that they process data in different granularities.

Additionally, this test will likely fail due to the seed handling issue identified in the tilelang_seeded_dropout function, where all blocks use the same seed.


112-119: LGTM: Test structure is appropriate.

The test function is properly decorated with @tilelang.testing.requires_cuda and calls both test runners. The main guard follows standard Python conventions.

Note that the test will need the issues identified in earlier comments to be resolved before it can pass reliably.

output_buffer = T.alloc_fragment((block_m,), dtype)

T.copy(A[block_start], A_shared)
T.rand(rand_buffer, seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: Same seed used across all blocks produces identical random sequences.

Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. Unlike the Triton reference (line 58), which passes element-specific offsets to ensure unique random numbers, this implementation will generate the same random sequence in each block. This breaks the dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.

Verify whether the TileLang rand implementation automatically accounts for block/thread IDs to generate unique sequences:


🏁 Script executed:

#!/bin/bash
# Check if _rand_parallel_impl uses block/thread context for offset

# Search for the implementation of _rand_parallel_impl
ast-grep --pattern 'def _rand_parallel_impl($$$)'

# Also search for how seed/offset is handled in random.py
rg -n -A 10 '_rand_parallel_impl'

Length of output: 1226



---

🏁 Script executed:

```shell
#!/bin/bash
# Find the definition of T.rand function

# Search in language module for rand function
rg -n 'def rand' --type=py

# Also check if it's an alias or imported
rg -n 'rand\s*=' --type=py -A 2 -B 2

# Search for rand_buffer allocation in test file
rg -n 'rand_buffer' --type=py -B 5 -A 5

Length of output: 3787


🏁 Script executed:

#!/bin/bash
# Also search more broadly for rand in the tilelang module
rg -n '\bT\.rand' --type=py -C 3

Length of output: 1327



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the full definition of T.rand including docstring
rg -n 'def rand' --type=py -A 15 tilelang/language/random.py

Length of output: 209


Critical: Same seed used across all blocks produces identical random sequences.

Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. The T.rand function has no offset parameter to differentiate between blocks. Unlike the Triton reference (lines 23-28), which passes element-specific offsets to ensure unique random numbers, all blocks will generate identical random sequences. This breaks dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.

To fix: Modify T.rand to accept an offset parameter (like Triton), or compute different seeds/offsets per block using block IDs before calling T.rand.

🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_rand.py around line 82,
calling T.rand(rand_buffer, seed) with the same seed in every block produces
identical random sequences across blocks; change the call so each block uses a
distinct offset/seed: either update T.rand to accept an additional offset
parameter (like Triton) and pass a per-element/per-block offset computed from
block IDs, or compute a unique seed per block (e.g., seed + block_id or seed ^
block_id) before calling T.rand; ensure the offset/seed computation uses the
thread/block indices available in the kernel so different blocks generate
independent random streams.

@Cunxiao2002
Copy link
Contributor Author

It seems there are still some issues when the buffer is 2D...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
tilelang/language/random.py (1)

13-21: Emit the intrinsic and supply a writable pointer.

Line 13: Without wrapping the T.call_intrin in T.evaluate, the call is dropped during lowering, so the Philox generator never runs. Line 16: buffer.access_ptr("r") returns a read-only pointer, which cannot legally be passed to a float* parameter and will fail compilation once CUDA codegen instantiates the call. Please wrap the intrinsic in T.evaluate and request a read/write pointer before passing it to the device helper.

-    T.call_intrin(
-        "handle",
-        tir.op.Op.get("tl.philox_rand"),
-        buffer.access_ptr("r"),
-        total_elems,
-        block_m,
-        block_n,
-        seed,
-        n_rounds,
-    )
+    T.evaluate(
+        T.call_intrin(
+            "handle",
+            tir.op.Op.get("tl.philox_rand"),
+            buffer.access_ptr("rw"),
+            total_elems,
+            block_m,
+            block_n,
+            seed,
+            n_rounds,
+        )
+    )
src/tl_templates/cuda/random.h (1)

44-49: Correct the uint32→float conversion.

Line 44: Reinterpreting the value as signed, taking an absolute value, and scaling by 1/2³¹ both introduces UB (on INT32_MIN) and produces a biased distribution. The canonical mapping to [0, 1) is simply x * 2.3283064365386963e-10f.

 TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
-  const float scale = 4.6566127342e-10f;
-  int32_t x_int32;
-  memcpy(&x_int32, &x, sizeof(uint32_t));
-  int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
-  return (float)x_abs * scale;
+  return x * 2.3283064365386963e-10f;
 }
testing/python/language/test_tilelang_language_rand.py (2)

24-28: Fix Triton mask and drop the unused dtype parameter.

Lines 24-28: Using mask=offset < M only covers the first block; later blocks never store anything. The dtype kernel argument is also unused. Please replace the mask with a check against the overall element count and remove the dead parameter.

-@triton.jit
-def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
+@triton.jit
+def triton_rand_1d(X, M, seed, total_size):
     pid = tl.program_id(0)
     offset = pid * M + tl.arange(0, M)
     rand = tl.rand(seed, offset)
-    tl.store(X + offset, rand, mask=offset < M)
+    tl.store(X + offset, rand, mask=offset < total_size)

31-39: Update the Triton call site to pass the total size.

Lines 31-39: After fixing the Triton kernel signature, the launch still passes the old arguments. Pass the total element count and drop the redundant tl.constexpr wrapper.

-    triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
-    grid = (1,)
-    BLOCK = tl.constexpr(M)
-    triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype))
+    triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
+    grid = (1,)
+    triton_rand_1d[grid](triton_result, M, seed=seed, total_size=M)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9dcec and ebd64c3.

📒 Files selected for processing (3)
  • src/tl_templates/cuda/random.h (1 hunks)
  • testing/python/language/test_tilelang_language_rand.py (1 hunks)
  • tilelang/language/random.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
testing/python/language/test_tilelang_language_rand.py (4)
src/tl_templates/cuda/random.h (1)
  • tl (5-86)
tilelang/language/allocate.py (2)
  • alloc_fragment (59-70)
  • alloc_shared (27-42)
tilelang/language/random.py (1)
  • rand (25-40)
tilelang/language/tir/op.py (1)
  • if_then_else (3014-3044)
tilelang/language/random.py (2)
src/op/builtin.h (1)
  • tvm (13-577)
tilelang/language/tir/op.py (1)
  • call_intrin (120-145)
🪛 Ruff (0.14.3)
testing/python/language/test_tilelang_language_rand.py

24-24: Unused function argument: dtype

(ARG001)


114-114: Unused function argument: dtype

(ARG001)

tilelang/language/random.py

12-12: Unused function argument: dtype

(ARG001)


37-38: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

Comment on lines 65 to 83
int elems_per_thread = (total_elems + num_threads - 1) / num_threads;

for (int i = 0; i < elems_per_thread; i++) {
int local_linear_idx = tid * elems_per_thread + i;
int local_row = local_linear_idx / block_n;
int local_col = local_linear_idx % block_n;

int global_row = block_row_offset + local_row;
int global_col = block_col_offset + local_col;
uint64_t global_idx = (uint64_t)global_row * N + global_col;

uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
uint32_t c2 = 0U;
uint32_t c3 = 0U;

philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
output[i] = uint32_to_uniform_float_device(c0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix overlapping writes and guard the tail.

Lines 65-83: Every thread writes to output[i], so threads clobber each other’s results. When total_elems < blockDim.x * elems_per_thread, local_linear_idx also exceeds the valid range, causing out-of-bounds writes. Guard the loop with local_linear_idx < total_elems and index the buffer with local_linear_idx.

-  for (int i = 0; i < elems_per_thread; i++) {
-    int local_linear_idx = tid * elems_per_thread + i;
+  for (int i = 0; i < elems_per_thread; i++) {
+    int local_linear_idx = tid * elems_per_thread + i;
+    if (local_linear_idx >= total_elems) {
+      break;
+    }
     int local_row = local_linear_idx / block_n;
     int local_col = local_linear_idx % block_n;
 
     int global_row = block_row_offset + local_row;
     int global_col = block_col_offset + local_col;
     uint64_t global_idx = (uint64_t)global_row * N + global_col;
 
     uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
     uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
     uint32_t c2 = 0U;
     uint32_t c3 = 0U;
 
     philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
-    output[i] = uint32_to_uniform_float_device(c0);
+    output[local_linear_idx] = uint32_to_uniform_float_device(c0);
   }
🤖 Prompt for AI Agents
In src/tl_templates/cuda/random.h around lines 65 to 83, threads currently write
to output[i] unguarded which causes overlapping writes and potential
out-of-bounds when local_linear_idx >= total_elems; modify the loop to check
that local_linear_idx < total_elems before doing work and write into
output[local_linear_idx] (not output[i]) so each thread writes to the correct
global slot and the tail is safely guarded.

Comment on lines 81 to 91
T.copy(A[block_start], A_shared)
T.rand(rand_buffer, seed)

for i in T.Parallel(block_m):
output_buffer[i] = T.if_then_else(
rand_buffer[i] > p,
A_shared[i] / (1.0 - p),
0.0,
)

T.copy(output_buffer, B[block_start])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard loads/stores in the last dropout block.

Lines 81-91: T.copy(A[block_start], A_shared) and T.copy(output_buffer, B[block_start]) always move block_m elements. When M % block_m != 0, the final block reads and writes past the end of A/B, leading to undefined behaviour. Please compute the remaining element count (e.g. valid = T.min(block_m, M - block_start)) and guard both copies and the parallel loop so the tail block only touches valid indices.

🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_rand.py around lines 81 to 91,
the copies and the parallel loop always move block_m elements which overruns A/B
when M % block_m != 0; compute valid = T.min(block_m, M - block_start) and use
that to bound all tail operations: copy only valid elements from A to A_shared
and from output_buffer back to B, and change the T.Parallel loop to iterate over
valid instead of block_m so rand_buffer/output_buffer and the conditional write
only touch indices < valid.

Comment on lines +138 to +142
with T.Kernel(T.ceildiv(M, block_m), T.ceildiv(N, block_n), threads=64) as (bx, by):
rand_buffer = T.alloc_fragment((block_m, block_n), dtype)
T.rand(rand_buffer, seed)
T.copy(rand_buffer, A[bx * block_m, by * block_n])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle partial tiles in the 2D generator.

Lines 138-142: T.copy(rand_buffer, A[bx * block_m, by * block_n]) blindly writes block_m * block_n elements. For matrices whose dimensions aren’t multiples of block_m/block_n, the last tiles run out of bounds. Add per-dimension guards (similar to the Triton mask) or clamp the copy extents to the remaining rows/columns before issuing the copy.

@LeiWang1999
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 344 to 346
TIR_DEFINE_TL_BUILTIN(philox_rand)
.set_num_inputs(4)
.set_attr<TCallEffectKind>("TCallEffectKind",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register philox_rand with correct input count

The new intrinsic is registered with set_num_inputs(4) even though every caller passes six arguments (buffer pointer, total_elems, block_m, block_n, seed, n_rounds) and the CUDA helper tl::philox_rand also takes six parameters. TVM’s intrinsic registry enforces num_inputs, so this mismatch will trigger a check failure when T.rand is lowered, making the feature unusable. The intrinsic should accept six inputs (or -1 for variadic) to match the call sites.

Useful? React with 👍 / 👎.

Comment on lines 65 to 82
int elems_per_thread = (total_elems + num_threads - 1) / num_threads;

for (int i = 0; i < elems_per_thread; i++) {
int local_linear_idx = tid * elems_per_thread + i;
int local_row = local_linear_idx / block_n;
int local_col = local_linear_idx % block_n;

int global_row = block_row_offset + local_row;
int global_col = block_col_offset + local_col;
uint64_t global_idx = (uint64_t)global_row * N + global_col;

uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
uint32_t c2 = 0U;
uint32_t c3 = 0U;

philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
output[i] = uint32_to_uniform_float_device(c0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Store Philox results into unique buffer indices

Inside the CUDA helper philox_rand each thread computes local_linear_idx for the element it owns, but the generated random value is written to output[i] instead of output[local_linear_idx]. When more than one thread participates this causes every thread to overwrite the same first few positions, leaving most of the buffer uninitialized and yielding repeated numbers when copied back to global memory. It can also write past the end of the buffer when total_elems is not a multiple of the thread count. The output pointer should be indexed by local_linear_idx (with a bounds check) so each thread writes to its own slice.

Useful? React with 👍 / 👎.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants